ragchatbot-codebase/docs/course3_script.txt at main · https-deeplearning-ai/ragchatbot-codebase

25 lines (24 loc) · 59.9 KB

Course Title: Advanced Retrieval for AI with Chroma

Course Link: https://www.deeplearning.ai/short-courses/advanced-retrieval-for-ai/

Course Instructor: Anton Troynikov

Lesson 0: Introduction

Lesson Link: https://learn.deeplearning.ai/courses/advanced-retrieval-for-ai/lesson/kb5oj/introduction

Rag or Retrieval Augmented generation retrieves relevant documents to give contexts and lm. And this makes it much better at answering queries and performing tasks. Many teams are using simple retrieval techniques based on semantic similarity or embeddings, but you learned more sophisticated techniques in this course, which let you do much better than that. A common workflow in Rag is to take your query and embed that, then find the most similar documents, meaning ones with similar embeddings. And that's the context. But the problem with that is that it can tend to find documents that talk about similar topics as a query, but not actually contain the answer. But you can take the initial user query and rewrite. This is called query expansion. Rewrite it to put in more directly related documents. Two key related techniques. One to expand the original query into multiple queries by rewording or rewriting it in different ways. And second, to even guess or hypothesize what the answer might look like. To see if we can find anything in a document collection that looks more like an answer, rather than only generally talking about the topics of the query. I'm delighted the instructor for this course is Anton trying to call. Anton has been one of the innovators driving for the Soviets and retrieval for AI applications. He is co-founder of Chroma, which provides one of the most popular open source vector databases. If you've taken one of our Lansing short courses taught by Harrison Chase, you have very likely use chroma. Thank you Andrew. I'm really excited to be working with you on this course and share what I'm seeing out in the field in terms of what does and doesn't work in Rag deployments. We'll start off the course by doing a quick review of Rag applications. You will then learn about some of the pitfalls of retrieval where simple vector search doesn't do well. Then you'll learn several methods to improve the results. As Andrew mentioned, the first methods use an LM to improve the query itself. Another method, Rerank query, results with help from something called a cross encoder, which takes in a pair of sentences and produces a relevancy score. You'll also learn how to adapt the query embeddings based on user feedback to produce more relevant results. There's a lot of innovation going on in Rag right now. So in the final lesson, we'll also go over some of the cutting edge techniques that aren't mainstream yet and are only just now appearing in research. And I think they will become much more mainstream soon. We'd like to acknowledge some of the folks who have worked on this course from the Chrome team. We'd like to thank Jeff Hoover, Hamad Bashar B Trump and Ben Eggers, as well as Chrome is open source developer community from the Deep Learning team. We have Jeff Lodwick and Mark Gregory. The first lesson starts with an overview of wreck. I hope you go on to watch that right after this. And with these techniques, it turns out, is possible for smaller teams than ever to build effective systems. So after this course, you might be to build something really cool with an approach that previously would have been considered rag tag.

Lesson 1: Overview Of Embeddings Based Retrieval

Lesson Link: https://learn.deeplearning.ai/courses/advanced-retrieval-for-ai/lesson/ukzj4/overview-of-embeddings-based-retrieval

In this first lesson, we're going to review some of the elements in an embeddings based retrieval system and how that fits together in a retrieval augmented generation loop, together with an LM. So let's go. So the first thing I'd like to show you is the overall system diagram of how this works in practice. The way retrieval augmented generation works is you have some user query that comes in, and you have a set of documents that you've previously embedded and stored in your retrieval system in this case. You take your query, you run your queries through the same embedding model as you use to embed your documents, which generates an embedding. You embed declaring, and then the retrieval system finds the most relevant documents according to the embedding of that query, by finding the nearest neighbor embeddings of those documents. We then return both the query and the relevant documents to the LM, and the LM synthesizes information from the retrieved documents to generate an answer. Let's show how this works in practice. To start with, we're going to pull in some helper functions from our utilities. This function just basically is a basic word wrap function, which allows us to look at the documents in a nicely pretty printed way. And the example that we're going to use, we're going to read from a PDF. So we're going to pull in PDF reader. This is a really simple Python package that you can easily import. It's open source. And we're going to read from Microsoft's 2022 annual report. And so to do that we're going to extract the texts from the report using this PDF reader application. And all we're doing here is for every page that the reader has, we're extracting the text from that page, and we're also stripping the whitespace characters from those pages. Now, the other important thing that we really need to do is make sure that we're not sending any empty strings. There aren't any empty pages that we send to our retrieval system. So we're going to filter out those as well. And this little loop just basically checks to see if there's an empty string. And if there is we don't add it to the final list of PDF texts. And so just to show the sort of output that we get here, we'll print an example. And what we'll do is print the output of the first page of extracted text from this PDF. Here we are. And this is what the PDF reader has extracted as text from the first page of the document. So in our next step we need to chunk of these pages first by character and by token. To do that we're going to grab some useful utilities for link chain. We're going to use some Lang chain text splitters. We're going to use the recursive character text splitter and the sentence transformers token text splitter. It's important that we use the sentence Transformers token text splitter. And I'll explain why in just a moment. But first, let's start with the character splitter. The character splitter allows us to divide text recursively according to certain divider characters. And what that practically means is first, in each presented piece of text, the recursive character text splitter will find the double newlines and split on double newlines, and then if the chunks that got split are still larger than our target chunk size, in this case 1000 characters, it will use the next character to split them, then the next character, then justice space, and finally it will split just on character boundaries itself. Then we've also selected a chunk overlap of zero. This is a hyperparameter that you can play with to decide what optimal chunking looks like for you. So let's go ahead and run this. And we're going to output the output of the character text splitter. We're going to look at the 10th text split chunk that we got. And we're also going to output the total number of chunks that the character splitter gives us. So let's run this cell and take a look at the output. So we see the 10th chunk is all of this text according to the character recursive character text splitter. And this 347 chunks in total from this annual report PDF. So now we split by character. The character text splitting isn't quite enough, and the reason for that is because the embedding model, which we use called sentence transformers, has a limited context window widths. In fact, it uses 256 characters. That's the maximum context window length of our embedding model. This is a minor pitfall if you're not used to working with embeddings, you may not consider the embedding model context window itself. But it's very, very important because typically an embedding model will simply truncate any characters or tokens that are beyond its context window. So to make sure that we're actually capturing all the meaning in each chunk when we go to embed it, it's very important that we also chunk according to the token count. And what we're doing here is we're using the sentence transformer text splitter again with a chunk of overlap of zero. And we're using 256 tokens per chunk, which is the context window length of the sentence transformer embedding model. And I'll go into more detail about that embedding model in a little bit. And we are essentially taking all of the chunks that were generated by the character text splitter. And we are splitting them using the token text splitter. Let's put out similar output to what we had in the last cell and see what we observe here. Should we see a similar chunk? It's a little bit different to what we got before. Obviously it's fewer characters because we have only 256 tokens. This is again the 10th chunk. And we notice that we have a couple more chunks than we had before. In the previous output we had 347 chunks. In this output we have 349. So it's divided a couple of the existing chunks into more pieces. So we have our text chunks. That's the first step in any retrieval augmented generation system. The next step is to load the chunks that we have into our retrieval. And in this case we'll be using chroma. So to use chroma we need to import from itself. And we're going to use the sentence transformer embedding model as promised. Now let's talk a little bit about the sentence transformer embedding model and what this actually means and what an embedding model really actually even is. So the sentence transformer embedding model is essentially an extension of the Bert transformer architecture. The Bert architecture embeds each token individually. So here we have the classifier instruction token. And then I like dogs. Each token receives its own dense vector embedding. What a sentence transformer does is allow you to embed entire sentences or even small documents like we have here. By pooling the output of all the token embeddings to produce a single dense vector or per document, or in our case, per chunk sentence, transformers are great as an embedding model. They're open source, all the weights are available online, and they're really easy to run locally. They come built into chrome, and you can learn more about them by looking up the sentence Transformers website or taking a look at the linked paper. So that's why we're using sentence Transformers. And now hopefully it makes sense why we use the sentence transformer tokenizer text layer. So what we're going to do is we're going to create a sentence transformer embedding function. This is for use with comma. And we're going to demonstrate basically what happens when this embedding function gets called. So that's the output of this. So let's take a look. Now you may get this warning about hugging face tokenizers. This is a minor bug in hugging face. This is nothing to worry about. Perfectly normal. And here's the output that we get. And you can see this is one very, very long vector. It's a dense vector. Every entry in the vector has a number associated with it. And this is the representation of the 10th text chunk that we showed you before as a dense vector. This vector has 358 dimensions, which sounds like a lot unless you consider the full dimensionality of all English text, which is much, much higher. So the next step is to set up chroma. We're going to use the default chroma client, which is great if you're just experimenting in a notebook. And we're going to make a new chroma collection. And the collection is going to be called Microsoft Annual Report 2022. And we're also going to pass in our embedding function, which we defined before, which as you remember is a sentence transformer embedding function. We are going to create IDs for each of the text chunks that we've created. And they're just going to be the string of the number of their position in the total token split texts. And then what we're going to do is we're going to add those documents to our chroma collection and to make sure that everything is is the way we expect. Let's just output the count. After everything has been added. And let's run this cell. So now that we have everything loaded into chroma, let's connect and learn and build a full fledged Rag system. But we're going to demonstrate how querying and retrieval and LMS all work together. So let's start with a pretty simple query. I think if you're reading an annual financial report, one of the top questions you have in mind is what was the total revenue for this year? And what we're going to do is we're going to get some results from chroma by querying it. And we see here that we call query on our collection. We pass our query texts and we're asking for five results. Chroma. Under the hood we'll use the embedding function that you've defined on your collection to automatically embed the query for you. So you don't really have to do anything else to call that embedding function again. And we're going to pull the retrieve documents out of the results. This zero on the end is basically saying give me the results for the zero query. We only have the one query. And what we're going to output now is basically the retrieve documents themselves. And take a look. So let's run the cell. And we see that the documents that we get here are fairly relevant to our query. What was the total revenue we have. Classified revenue by different product and service offerings. We're talking about an unearned revenue. And there's more information in a similar vein. So the next step is to use this results together with an LLM to answer our query. We're going to use GPT for this. And we need to just do a little bit of set up so that we can have an open AI client. We're going to load our OpenAI API key from the environment so that we can authenticate. And we're going to create an OpenAI client. This is using their new version one API where they've wrapped everything in this one nice client object for us. So running the cell there won't be any output here, but everything's ready to go. Now we're going to define a function that allows us to call out to the model using our retrieved results, along with our query. We're going to use GPT 3.5 turbo, which does a reasonably good job in rack loops and is fairly quick and fast. So the first thing is we're going to pass in our query and retrieve documents. We're going to just join our retrieve documents into a single string called information. And we're gonna use the double new line to do so we're going to set up some messages. So the first thing is the system prompt. The system prompt essentially defines how the model should behave in response to your input. And here we're saying you are helpful expert my natural research assistant. Your users are asking questions about information contained in an annual report. You'll be shown the user's question and the relevant information from the annual report. Answer the user's question using only this information. So what this is doing, and this is really the core of the entire loop. We're turning GPT from a model that remembers facts into a model that processes information. That's the system prompt. And now we're going to add another piece of the message for our user content. And we have here that we're in the role of the user. And here's the content. The content is essentially a format of string that says here's our question. And that's just our original query. Here's the information you're supposed to use. Here's the information. Then we need to send the request to the OpenAI client, which is just using the normal API from the client. There's nothing special here at all. We're specifying the model. We're sending the messages. We're basically calling the chat completion endpoint on the OpenAI client, specifying a model and the messages we'd like to send and getting the response back. And then we need to do a little bit more just to unpack the response from what the client returns. So we have defined our function and now let's actually use it. Let's put everything together. So here's what we're going to do. We are going to say output is equal to calling read without query and retrieve documents. Then we're just going to print the word wrapped output. And far away. Finally there we go. The total revenue for the year ended June 30th, 2022 was $198,270 million for Microsoft. Microsoft are doing pretty well now. It's a good time to take a moment and try some of your own queries. So remember we specified the query a little bit further up. What was the total revenue? Try some of your own and see what the model outputs based on the retrieved results from the annual report. I think it's actually really important to play with your retrieval system to gain intuition about what the model and the retriever can and can't do together. Before we dive into really analyzing how the system works in the next lab, we're going to talk about some of the pitfalls in common failure modes of using retrieval in a retrieval augmented generation loop.

Lesson 2: Pitfalls Of Retrieval - When Simple Vector Search Fails

Lesson Link: https://learn.deeplearning.ai/courses/advanced-retrieval-for-ai/lesson/s49c1/pitfalls-of-retrieval---when-simple-vector-search-fails

In this lesson, we're going to learn a little bit about some of the pitfalls of retrieval with vectors. I want to show you some cases where simple vector search really isn't enough to make retrieval work for your AI application. Just because things are semantically close as vectors under a particular embedding model doesn't always mean you're going to get good results right out of the box. Let's take a look. First thing we need to do is just get set up. Our helper utilities this time will let us load up everything we need to load from chroma, have the right embedding function ready to go, and we're just going to do a little bit of setup. So again we're going to create the same embedding function. And we're going to use our helper function this time to load our chroma collection. And we're just going to output the count to make sure we've got the right number of vectors in there. And again don't worry about any of these warnings you might see. So yep, that's the right output through us 349 chunks embedded in chroma. So one thing that I personally find useful is to visualize the embedding space. Remember that embeddings and their vectors are a geometric data structure and you can reason about them spatially. Obviously embeddings are very high dimensional. Sentence transformer embeddings have 348 dimensions like we talked about, but we can project them down into two dimensions which humans can visualize. And this can be useful for reasoning about the structure of embedding space. To do that, we're going to use something called Umap. U map is uniform manifold approximation. And it's an open source library that you can use exactly for projecting high dimensional data down into two dimensions or three dimensions, so that you can visualize it. This is a similar technique to something like PCA or t-SNE, except you map explicitly tries to preserve the structure of the data in terms of the distances between points as much as it can. Unlike, for example, PCA, which just tries to find the dominant directions and project data down in that way. So we're going to import Umap and we'll grab numpy and we'll grab tkm. If you don't know what to cdms, it's a little thing that basically shows you a percentage bar, when you have some long running process. I like to use this so that I know how long the iterations are taking and how much longer I might be waiting. And we're going to grab all of the embeddings out of the Corona collection. And what we're going to do is we're going to fit a U map transform. So again, you map is basically a model which fits a manifold to your data to projected down into two dimensions. We're setting the random seed to zero here just so that we can get reproducible results. And we get the same projection every time. So let's go ahead and fit that transform. And again don't worry about any warnings you might get here. Now in this next step, now that we fitted the transform we're going to use the transform to project the embeddings. And we're going to define a function that does that. We're going to call it project embeddings. And it takes as input an array of embeddings. And it takes the transform itself. And we're going to start by declaring an empty array empty numpy array, of the same length of as our embeddings array, but with dimension two, because we're just going to get two dimensional projections out. And what we're going to do is we're going to project the embeddings one by one. The reason to do it one by one is just so that we get consistent behavior from you. Map the way that you map does. Projection is somewhat sensitive to its inputs. So to ensure that we have reproducible results, we're just going to project one at a time instead of in batches. And then of course we're just going to return the result of the function just the way that you would expect having defined the function. Let's run it on our data set. And this will take a minute. Great. So now that process is finished, let's project the embeddings and actually take a look at them. So we're going to grab matplotlib. And probably most of you are fairly familiar with matplotlib I know. We're going to make a figure. We're just going to do a scatter plot, of the projected embeddings. Now. So you can see we have predicted data set embeddings, the first element of each one and the second element of each one. And we're going to make them size ten just because it's visually pleasing. We're going to set some other properties of our axes. And there we go. And this is what our data set looks like inside chroma projected down to two dimensions. And you can see that we preserve some structure a little bit more advanced visualization would allow you to sort of hover over each of each of these dots and see what's actually in there, and you would see that things with similar meanings end up next to each other, even in the projection. Sometimes they're a little bit unusual structures, because a two dimensional projection cannot represent all of the structure of the higher dimensional space. But as I said, it is useful for visualization. And one thing that it's useful for is to bring your own thinking into a more sort of geometric setting and actually think about vectors and points, which is what embedding space retrieval is really all about. So what evaluating the quality and performance of a retrieval system is all about is actually relevancy and distraction. So let's take a look at our original query again the one that we used in our example. What's the total revenue. And we're going to do just the same thing as we did last time. We're going to query the chroma collection using this query. As for five results, then we're going to include the documents and the embeddings because we'd like to use those embeddings for visualization. And so we're going to grab our retrieve documents out of the results again. And let's print them out. And we see again the same results as we saw before. Retrieval is deterministic in this case. And we see that there are several revenue related documents. But also there are things that here that are might, you know, might not be directly related to revenue. And we see things like potentially costs, things that are to do with money, but not necessarily revenue. So let's take a look at how this query actually looks when visualized. So what we're going to do is grab the embedding for our query using the embedding function. And we're going to grab our retrieved embeddings as well which we get from our result. And what we're going to do is use our projection function to project both of these down to two dimensions. And then now that we've got the projections, we can visualize them and we can visualize them against the projection of the data set. I'll just copy paste this in. But it's again a scatterplot of the data set embeddings of the query embedding and of the retrieved embedding. And we're going to set the query embedding to be a read x. And we're going to see the selected or retrieved embeddings as empty circles which are green. So let's go ahead and see what that looks like. And here we are. So this is a visualization of the query and the retrieved embeddings. You can see the query here is this red x and the the green circles basically circle those data points that we actually end up retrieving. Notice that it doesn't look in the projection like these are the actually nearest neighbors. But remember we're trying to squash down many, many higher dimensions into this two dimensional representation. So it's not always going to be perfect. But the important thing is to basically look at the structure of these results. So you can see some are more outlier than others. Right. And this is actually the heart of the entire issue, the embedding model that we use to embed queries and embed our data does not have any knowledge of the task or query we're trying to answer at the time we actually retrieve the information. So the reason that a retrieval system may not perform the way that we expect is because we're asking it to perform a specific task using only a general representation, and that makes things more complicated. Let's try visualizing a couple other queries in a similar way. So here I'm just going to copy paste the whole thing. But the query now is what's the strategy around artificial intelligence that is AI. So let's run and see what results we get. And you see here that AI is mentioned in most of these documents. And this is sort of vaguely related to AI. We have a commitment to responsible AI development. But then we have, you know, something about this, you know, information about a database which is not directly related to AI. And, you know, here we're talking about mixed reality applications and metaverse, which is, you know, tangentially related to, technology investments, but not necessarily directly AI related. So let's visualize, first of all, project the same way as we did in previous query. And then we will plot. Let's take a look. Here's our query and our related results. And they're all coming from the same part of the data set. But you can see that some of the results that we get, you know, and here this this point appears to be bang on where our query landed. So it's super, super relevant. So you can see that obviously this where the query lands in this space has geometric meaning. And we're pulling in related results. But again what's related is from the general purpose embedding model not from the specific tasks that we're performing. So let's take a look at another query. What has been the investment in research and development. And this is a very general query. And it should be reflected in the annual statement. So let's see what kind of documents we get back. We see that we start with, you know, general ideas about investments. Some of it is about research and development. For example, this document research and development expenses can include a third party development and programing costs. But we see that there are also distractors in this results. So a distractor is a result that is not actually relevant to the query. And it's called a distractor. Because if you pass this information to the large language model to complete your loop, the model tends to get distracted by this information and outputs suboptimal results. And the reason this is really important is that bad behavior from the model due to distractors is very difficult to diagnose and debug, both for the user, but also for developers and engineers building these types of systems. So it's very important to make your retrieval system robust and return relevant results and no distracting results to the model. So again, let's take a look at the projection. I always find it very, very helpful to visualize and again because this this is a geometric type of data. I find visualization is a great way to develop intuitions. So there's our projection and let's plot it. So here we see the results that we're getting are a lot more spread out. And the way you can imagine this is imagine all your data is a cloud of points sitting in this high dimensional space. A query that lands inside the cloud is likely to find nearest neighbors that are sort of densely packed and close together inside the cloud, but a query that lands outside the cloud is likely to find nearest neighbors from a lot of different parts, of that cloud. So they tend to be more spread out. So geometrical intuition. So finally, I think it's really important to understand what happens when we put an irrelevant query, into our retrieval system. So let's find out what Michael Jordan has done for us lately in terms of the Microsoft annual report from 2022. Obviously, this is I would be very surprised if this was at all a relevant query. And when we look at the results, of course, none of them have anything to do with Michael Jordan. This doesn't mention him at all, and neither do any of these documents. Neither do any of these results. And that's what we should expect. But remember, if we're using a retrieval system as part of a loop, you're guaranteed to return the nearest neighbors. In this case, your context window is going to be made up entirely of distractor, which, as I mentioned earlier, can be very, very difficult to understand and debug from the application user's perspective and from the application developers perspective. So we need a way to deal with irrelevant queries as well as irrelevant results. And again, let's take a look at the projection. Let's see if there's some something we can understand. Great. We've projected and let's put. It you can see that the results about Michael Jordan are really all over the place, which I guess shouldn't surprise us given that the query is totally irrelevant to any of the data that we have in our data set. Try visualizing some of your own queries and the way that we've done here, and see how they influence the structure, of the returned results. See if you can get queries to land in different parts of the data set and see what the return results say about the information that might be contained in that part of it. In this lab, you've learned how a simple embedding space retrieval system might return distracting or irrelevant results, even for simple queries. And you've learned how to visualize this data so you can gain some intuition about why and how the results are being returned. In the next lab, we'll show you some techniques to basically improve the quality of your queries using loops by using a technique called query expansion.

Lesson 3: Query Expansion

Lesson Link: https://learn.deeplearning.ai/courses/advanced-retrieval-for-ai/lesson/cwewy/query-expansion

The field of information retrieval has been around for a while as a subfield of natural language processing, and there's many approaches to improving the relevancy of query results. But what's new is we have powerful large language models, and we can use those to augment and enhance the queries that we send to our vector based retrieval system to get better results. Let's take a look. So the first type of query expansion we're going to talk about is called expansion with generated answers. Typically the way that this works is you take your query and you pass it over to an LM, which you prompt to generate a hypothetical or imagined answer to your query. And then you concatenate your query with the imagined answer and use that as the new query, which you pass to your retrieval system or a your database. Then you return your query results as normal. Let's take a look at how this works in practice. So the first thing that we're going to do is once again grab all the utilities that we need. And we're going to load everything we need from call and create our embedding function. And we're going to set up our OpenAI client again because we'll be using the LLN. And once again to help with visualization, we're going to use your map and project our data set so that that's all ready to go for us. Now that we're done setting up, let's take a look at expansion with generated answers. And there's a reference here to the paper which demonstrates some of the empirical results that you can get by applying this method. So to do expansion with generated answers we're going to use an LM in this case GPT. And just the same as last time we're going to prompt the model in a particular way. Let's create this function called augment query generated. And we're going to pass in a query. We're also going to pass in a model argument in this case GPT 3.5 turbo by default. And we're going to prompt the model. And in the system prompt we're going to say you're a helpful expert financial research assistant. Provide an example answer to the given question that might be found in a document like an annual report. In other words, we're pretty much asking the model to hallucinate, but we're going to use that hallucination for something useful. And in the user prompt, we're just going to pass the query as the content. And then we'll do our usual unpacking of the response. And that defines how we're going to prompt our model. Let's wire this together. Here's our original query asking was there a significant turnover in the executive team? We will generate a hypothetical answer and then we'll create our joint query, which is basically the original query prepending to a hypothetical answer. Let's take a look at what this actually looks like after we generate it. So here we see the output. We see our original query. Was there a significant turnover in the executive team and a hypothetical answer. In the past fiscal year, there was no significant turnover in the executive team. The core members of the executive team remain unchanged, etc. so let's send this query plus the hypothetical response to our retrieval system as a query, and we'll query the Croma collection the usual way and print out our results. And we're sending the joint query as the query to our retrieval system. And we're retrieving the documents and the embeddings again. So these are the documents that we get back. We see things here discussing leadership. We see how consultants and directors work together. Here we have sort of an overview of the different directors that we had in Microsoft. And we talk about the different board committees. Let's visualize this. Let's see what sort of difference this made. So to that we get our retrieved embeddings. We get the embedding for our original query. We get the embedding for our joint query. And then we project all three and plotting the projection. And we see the red is our original query. The orange box is our new query with the hypothetical answer. And we see that we get this nice cluster of results. And but most importantly, what I want to illustrate here is that using the hypothetical answer moves our query elsewhere in space, hopefully producing better results for us. So that was query expansion with the generated queries. But there's another type of query expression we can also try. This is called query expansion with multiple queries. And the way that you use this is to use the alum to generate additional queries. That might help answering the question. So what you do here is you take your original query, you pass it to the Elm. You ask the Elm to generate several new related queries to the same original query, and then you pass those new queries along with your original query to the vector database or your retrieval system that gives you results for the original and the new queries. And then you pass all of those results to the LLM to complete the loop. So let's take a look at how this works in practice. Once again the starting point is a prompt to the model. And we see here that we have a system prompt. And the system prompt is a bit more detailed. This time we take in a query which is our original query. And we ask the model. It's a helpful expert financial research assistant. The users are asking questions about an annual report. So this gives the model enough context to know what sorts of queries generate. And then you say suggest up to five additional related questions to help them find the information they need for the provided question. Suggest only short questions without compound sentences. And this. Make sure that we get simple queries. So just a variety of questions that cover different aspects of the topic. And this is very important because there are many ways to rephrase the same query. But what we're actually asking for is different. But related queries. And finally, we want to make sure that they're complete questions. They're related to the original question. And we ask some formatting output. One important thing to understand about these techniques in particular that bring an LLM into the loop of retrieval is prompt engineering becomes a concern. It's something that you have to think about. And I really recommend that you as a student play with these prompts. Once you have a chance to try to lab, see how they may change, see what different types of queries you can get the models to generate. Try different models and basically experiment not just with the retrieval system, but with the prompts you're using to augment your queries. So let's define this function and let's see what we get when we actually try this. So here's our original query. What were the most important factors that contributed to increases in revenue. So to say this is a compound query. This is you know, there could be many, many different factors. And it's not just about revenue. And let's see what augmented queries we get back. Last model and let's print a set of augmented queries as output. Great. So we got a few back. We see that. What were the most important factors that contributed to decreases in revenue? Great question. What were the sources of revenue? Also very important. How were sales and revenue distributed across the different product lines? Were there any changes in pricing strategy? Did the company acquire any new customers? So you can see that these are related questions to our original query. But they're not precisely the same and they have different meanings. That's very, very useful. And that's a great illustration of augmenting an original query through query expansion with multiple queries. So let's see how this works in practice. Once we pass these queries to our retrieval system. So first we build our set of queries. Now Croma can handle multiple queries in parallel. So what we're doing here is taking our original query in an array. And then concatenating that with our array of augmented queries. So now we have one array where each entry is a query, our original query plus the augmented queries. And we can grab the results. And again from a can do querying in batches. And let's look at the retrieve documents that we get. And one thing that's important here is because the queries are related. You might get the same document retrieved for more than one query. So what we need to do is to duplicate the retrieved documents. And that's what we do here. And finally let's just output the document so we get. So we can see now the documents that we got for each query. And these are all to do with revenue different aspects of revenue growth which is exactly what we were hoping for. We have increases in windows revenue. We can see things that are coming from other components. So for example, what were the most important factors that contributed to decreases in revenue. So we see increased sales and marketing expenses, different types of investments, different types of tax breaks. Essentially each of these augmented queries are providing us with a slightly different set of results. And let's visualize that. What did we actually get in geometric space in response to these results. So again, we'll take our original query embedding and our augmented query embeddings and project them. And the next thing we'll do is project that result embeddings. Before we do that we need to flatten the list because we have a list of embeddings per query. We just want the flat list of and returned embeddings. And then we just project them as before. Let's visualize what we get and we see that using query expansion, we're able to actually hit other related parts of the data set that our single original query may not have reached. And this gives us more of a chance to find all of the related information, especially in the in the context of more complex queries, which require more and different types of information to answer. So here we see that the read x is our original query, the orange XS are the augmented, the new queries generated for us by the alum and I. Once again, the green circles represent the results that we actually returned by the retrieval system to the model. One way to think about this is that a single query turns into a single point in embedding space, and a single point in embedding space likely doesn't contain all of the information that you need to answer. More complex query like this one. So using this form of query expansion where we generate multiple related queries using an LLM gives us a better chance of capturing all of the related information. The downside of this, of course, is now we have a lot more results than we had originally, and we're not sure if and which of these results are actually relevant to our query. In the next lab, using cross encoder Reranking, we have a technique that allows us to actually score the relevancy of all the returned results and use only the ones we feel match our original query. And I'll demonstrate that in the next lab. In this lab, I recommend that you try playing around with the query expansion prompts. Try your own queries and see the types of results you get by asking different types of questions about the Microsoft Annual report.

Lesson 4: Cross Encoder Re Ranking

Lesson Link: https://learn.deeplearning.ai/courses/advanced-retrieval-for-ai/lesson/nusf7/cross-encoder-re-ranking

In the last lesson, we looked at how to improve retrieval results by augmenting the query we sent with an LM. In this lesson, we're going to use a technique called cross encoder reranking to score the relevancy or about retrieve results for the query that we sent. Let's dig in Reranking as a way to order results and score them according to their relevancy to a particular query. So let's take a look at how this works underneath in Reranking. After you retrieve results for a particular query, you pass these results along with your query to a Reranking model. This allows you to rerank the output so the most relevant results have the highest rank. Another way to think about this is your Reranking model scores. Each of the results conditioned on the query, and those with the highest score are the most relevant. Then you can just select the top ranking results as the most relevant to your particular query. So let's take a look at how to do this in practice. First we import our helper functions as before. And we load the data into. So one use of Reranking is to get more information out of the long tail of query results. So let's take a look at this query that we've already covered once before, which is what has been the investment in research and development. And usually we've been asking for five results return for our particular query. But now we're going to ask for ten. That means we're going to get a longer tail of possibly useful results. And again we're going to include documents and embeddings. So let's retrieve the documents and and take a look at what we get. We see that we get the same first five results as before because retrieval is deterministic. But we also have five new results which might have relevant information to our question. The trick is to figure out which of these results are actually relevant to our specific query, instead of just being the nearest neighbors in embedding space. And the way we do that is through using cross encoder Reranking. So we're going to use the sentence transformer cross encoder. And we're going to instantiate it with a particular model. So what is a cross encoder model. Sentence transformers are made up of two kinds of models. There's something called a BI encoder where a BI encoder encodes queries separately. And then we can use the output of those bi encoders to perform cosine similarity and find the nearest neighbors. In contrast, a Bert cross encoder takes both our query and our document and passes it through a classifier which outputs a score. And in this way, we can use our cross encoder to score our retrieve results by passing our query and each retrieve document and scoring them using the cross encoder. We can use the cross encoder by passing in the original query and each one of the retrieved documents, and using the resulting score as a relevancy or ranking score for our retrieved results. So we've instantiated our cross encoder, and the first thing we're going to do is create pairs. The pairs consist of our query and each doc in our retrieve documents. And we're just going to ask the cross encoder to score each pair. So let's print out our scores. And while we see that the first two documents have high scores for our query, we notice first of all, that the second retrieve document is actually a much higher score than our first one. And also some documents in the longer tail of retrieved results have higher scores than some of the documents in the first five. So what would that look like if we were to reorder our documents according to score? We see that the second document is now ranked first. First document is ranked second. And something in the long tail actually makes it into the top five. And in fact, the top five ranked results contain results that originally the sixth and seventh results, while the fourth and fifth results are actually ranked by lower. So in this way, we've used the cross encoder and the score that it produces to rerank our results. And now if we were to cut to this top five, we'd see that the results should be much more relevant than what we had before, because we've mined more of the long tail for information that's actually relevant to our question. Now, you might already be see where I'm going with this, but given the number of results that we get with query expansion and the way in which each generated query addresses a different part of the complex question, we can use the crushing further reranking technique to actually get all of the best results for the original query from the augmented expanded queries, instead of just sending all of them to the left. And here's how we do that. So from the previous lab, this is just our original query. And the generated queries. And I've saved them into text here for you. And then we do the same thing. We concatenate the original and generated queries together and we retrieve the results. And then as last time we did duplicate the retrieved results. And now we create Paris, just as we did in the previous example, where we make pairs of the original query and each retrieve document. In this way, we can compute the relevance of the retrieved results for the augmented queries to the original query and select among them the five best that we actually want to pass to the left. So let's create those pairs. And let's score them. And one great thing about using a cross encoder model like this one is it's extremely lightweight and runs completely locally. So here are the scores of all of our retrieved results. And we can use these scores to order our results and give us a new ordering. And then we can pass the top five of this new results to the LLM and get the most relevant information from this long tail of queries that we got from query expansion and the retrieved results for our augmented queries. So in this lab, we learned how to use a cross encoder as a Reranking model. And we've seen how we can apply Reranking both to get more out of the long tail of a single query, as well as to filter the results of an augmented expanded query to only the results relevant to the original query itself. This is a really powerful technique and worth experimenting with some more, and it's a good idea to try to understand and get an intuition for how the reranking score might change depending on query, even when the result that your retrieval system is giving you are the same. This is because the cross encoder rear anchor can emphasize different parts of the query than the embedding model does. And so the ranking that it provides is much more conditional on the specific query than just what is naively returned by the retrieval system itself. In the next lab, we'll talk about query adapters. Query adapters are a way to directly alter or augment the query, embedding itself using user feedback or other types of data to get better query results.

Lesson 5: Embedding Adaptors

Lesson Link: https://learn.deeplearning.ai/courses/advanced-retrieval-for-ai/lesson/s5dr4/embedding-adaptors

In the last couple of lessons, we've looked at how we can use query augmentation and cross encoder reranking to improve retrieval results. In this lesson, I'm going to show you how we can use user feedback about the relevancy of retrieved results to automatically improve the performance of the retrieval system using a technique called embedding adapters. Let me show you how it works. Getting adapters are a way to alter the embedding of a query directly in order to produce better retrieval results. In effect, we insert an additional stage in the retrieval system called the embedding adapter, which happens after the embedding model. But before we retrieve the most relevant results, we train the embedding adapter using user feedback on the relevancy of our retrieved results for a set of queries. Let's get into it. So the first thing we do is grab our helper functions as before. One special thing here is we're going to need torch because we're going to effectively train a model, but a very lightweight one. And we create our embedding function and load everything into. And again we project all of our data. So the first thing we need for this approach is a data set. We don't have one ready because we haven't really had users using our application. But we can use a model to generate a data set for us. And once again, this is all about creating right prompt. So we're going to use GPT again. And essentially we're prompting the model as an expert helpful financial research assistant. And it should suggest 10 to 15 short questions that are important to ask when analyzing an annual report and with some guidelines about what the output should be like. And this will generate some queries that users might actually have run against our system. So let's ask how to generate this query. We see that these are fairly reasonable questions to ask about any company's financial statements. So we're going to get the results from Croma and what we're going to do. And we'll get the retrieve documents associated with the results. And what we're going to do is also ask the model to evaluate the results. In a real system, you can easily ask our users to give a thumbs up or thumbs down on the generated output, and then reference that with the retrieve results to give a signal about what results were actually relevant and which ones. In this case, we don't quite have that, but we can use a model to evaluate the relevancy of the retrieved results for each query. And again, this is just about prompting the model. So we're going to ask our helpful expert financial assistant to tell us whether a given statement is relevant to the given query. And we're going to ask it to output only yes or no. And then we're going to essentially transform yeses to ones and those two negative ones. And I'll explain why in just a minute. That's the prompt. And then what we're going to do is we are going to get our retrieved embeddings and our query embeddings. And we're going to start making a data set to train our embedding adapter. And the way we're going to do it is like this. We're going to have our doctor query embeddings, a doctor doc embeddings and our adapter labels. Now the adapter prefix just means we're going to use this in a data set. They're not they're not special in any way. They're just the embeddings of our queries and the embeddings for our documents. The labels we're going to get from our evaluation model. The label is going to be plus 1 or -1, depending on whether the document is relevant or not to the given query. So we're just going to loop over everything to create these triples. So the model is performing an evaluation for us. Now it's no mistake that our labels are plus one and minus one. Because what we're going to do when we're training our embedding adopter model is use these values as our loss function for cosine distance when two vectors are identical. The cosine similarity between them is one. When two vectors are opposite, the cosine similarity between them is negative one. In other words, we want relevant results to point in the same direction as vectors, and we want irrelevant results to point in the opposite direction from a given query. And this is the model that we're going to train. That's exactly what it's going to try to do. All right let's check out the length of our data set. Great. 150. So that's 15 queries with ten results each for each one labeled for relevancy. So the next thing we need to do because we're using torch to train our embedding adapter, is we need to transform our data set into a torch tensor data set. So we're just going to do some data manipulation here to transform these into torch tensor types. And finally we're going to pack everything into a torch data set. So let's set up our embedding. Adapt our model. The first thing is to set up the model itself. And the model is fairly straightforward. It takes as input a query embedding, a document embedding and an adapter matrix. We compute an updated query embedding by multiplying our original query embedding by the adapter matrix, and then we compute the cosine similarity between our updated query embedding and our document embedding. Next, let's define our loss function. Again our loss takes a query embedding, document embedding adapter matrix and label. And we run the model to compute the cosine similarity. And we compute the mean squared error between the cosine similarity and the label. And you'll notice again that the plus one label means that the cosine similarity says that the vectors are pointing in the same direction, and a negative one label means they should be pointing in the opposite direction. In this way, we want our queries to be pointing in the same direction as relevant documents and in the opposite direction to irrelevant documents. And this is what we're training our adapter matrix to do. We initialize our adapter matrix for training. You might recognize this is very similar to a linear layer, in a traditional neural network. And that's really all we're doing. Next let's set up our training loop. We set our minimum loss and our best matrix as things to keep track of. Let's train for 100 epochs for each query embedding document embedding a label in our Torch data set. We compute our loss if the loss that we computed is better than our previous loss. We'll keep track of. That is the best matrix so far. And then we backpropagate. And let's run our training loop. And you can see it's very very fast because again this is exactly the same thing is trained as if we were training a single linear layer of a traditional neural network. So let's take a look at the best loss that we got. This is pretty good. A loss of 0.5 is pretty good. It means we've got pretty much a halfway improvement in terms of where we started from. So one thing we'd like to take a look at is how the adapter matrix influences our query vector. To do that, we can construct a test vector consisting of all the ones, and we can multiply that test vector by our best matrix. And what this will tell us is which dimensions of our vectors get scaled by what amount. You can think of an embedding adapter, a stretching space, and squeezing space for the dimensions which are most relevant to the particular queries that we have, while reducing the dimensions that are not relevant to our query. You'll also notice that it can reverse dimensions. So let's plot what that looks like. And here you can see how each dimension of our test vector, which consists only of ones, has been stretched and squeezed. Some have been elongated a lot, while others have been made to be almost zero. And so what this means is our embedding adopter has basically decided, okay, these dimensions are more relevant. These are less relevant. These are actually opposite to the things that we want to find. And these things are actually more relevant to the things that we want to find. Now let's take a look at what effect this actually has on our queries. So let's do as we did before. We'll take our generated queries and embed them. And let's compute our also our adapted query embeddings. And then we project them. Now let's plot what we get against our dataset. And as you can see our original queries were quite scattered around. But our new queries concentrate on a certain part of the data set which is most relevant to our queries. You can see how the read queries have been adapted through the embedding out there to transform them into the green queries, to push them into particular part of the space. So, as you can see, an embedding adapter is a simple but powerful technique for customizing query embeddings to your specific application. In order to make this work. You need to collect a data set either a synthetic one like the one we've generated here, or else one that's based on user data. User data usually works best because it actually means that people are using your retrieval system for their specific tasks. Again, because this approach involves prompting and because it involves the use of a large language model, it's worth experimenting with the prompts. And it's also worth experimenting with different initializations of the adapter matrix. Even maybe consider using a full lightweight neural network and training that instead of a simple matrix instead. You might want to tune the hyperparameters of the embedding, adapt their training process, or you might want to collect more specific data and try this out with, a specific application in mind, rather than our very general one of trying to understand a financial statement. In the next lesson, we'll cover some other techniques which are just now emerging from research to improve embedding based retrieval systems.

Lesson 6: Other Techniques

Lesson Link: https://learn.deeplearning.ai/courses/advanced-retrieval-for-ai/lesson/l1uaj/other-techniques

Embeddings. Based retrieval is still a very active area of research, and there's a lot of other techniques that you should be aware of. For example, you can fine tune the embedding model directly using the same type of data as we used in the embeddings adapters lab. Additionally, recently there's been some really good results published in fine tuning the limits self to expect retrieved results and reason about them. You can see some of the papers highlighted here. Additionally, you could experiment with a more complicated embedding adapter model using a full blown neural network or even a transformer layer. Similarly, you can use a more complex relevance modeling model rather than just using the cross encoder as we described in the lab. And finally, an often overlooked piece is that the quality of retrieve results often depends on the way that your data is chunked before it's stored in the retrieval system itself. There's a lot of experimentation going on right now about using deep models, including Transformers for optimal and intelligent chunking. And that wraps up the course. In this course, we covered the basics of retrieval, augmented generation using embedding space retrieval. We looked at how we can use our LMS to augment and enhance our queries to produce better retrieval results. We looked at how we can use a cross encoder model for Reranking to score the retrieved results for relevancy, and we looked at how we can train an embedding adapter using data from human feedback about relevancy to improve our query results. Finally, we covered some of the most exciting work that's ongoing right now in the research literature around improving retrieval for AI applications. Thanks for joining the course, and we're really looking forward to seeing what you built.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

course3_script.txt

Latest commit

History

course3_script.txt

File metadata and controls