{"metadata":{"bundle_type":"episode_pack","bundle_version":"prompt24_v1","workspace_slug":"orbital","episode_id":"3be321c8-d6db-4ca5-8af8-1130c050e91f","exported_at":"2026-07-01T19:36:00.821883Z"},"summary":{"content_asset_count":0,"transcript_segment_count":200,"asset_types":{},"ranked_theme_ids":[],"theme_snapshot_ids":[]},"episode":{"id":"3be321c8-d6db-4ca5-8af8-1130c050e91f","source_id":"06ffb599-3006-40fa-911f-d6276f0fab54","source_slug":"yt-WRU7-4bpZkg-d9654309","transcript_document_id":"e93568f5-7c0c-4cc6-bef2-78a2560c42c2","raw_asset_id":"d3c58418-68a1-4440-8aa8-bfda0fbab021","title":"How to build a continuous evaluation pipeline for multi-agent systems with Gemini","description":"Check out the codelab discussed in this episode here. → https://goo.gle/prai-rs-2 Stop guessing whether AI agents are actually working. Manual vibe checks might be fine for prototyping, but production requires real data. On the next Google Cloud Live stream, the team is showing developers how to move from subjective testing to data driven assessment using Gemini Enterprise Agent Platform Pipelines and Cloud Run functions. Join Vlad Kolesnikov and Leonid Yankulin to learn how to build an automated regression testing pipeline for your distributed multi-agent systems. Watch along and learn: * Data driven assessment: Implement adaptive rubrics and tool use quality metrics to rigorously evaluate AI agents. * Shadow deployments: Safely deploy AI agents to a private tagged revision in Cloud Run. * CI/CD automation: Integrate continuous evaluation into pipelines to ensure code changes never degrade any agent's proven quality. Don't let invisible regressions break your production workflows.","external_url":"https://www.youtube.com/watch?v=WRU7-4bpZkg","status":"published","published_at":"2026-06-30T17:27:56Z","transcript_segment_count":200,"content_asset_count":0,"details_json":{"file_name":null,"published_at":"2026-06-30T17:27:56+00:00","transcript_format":"youtube_captions"},"latest_transcript_segments":[{"id":"b4432dc6-f68c-41c0-b7ac-37f443fc2ccc","segment_index":0,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"[music]"},{"id":"9ebe4eb9-22f4-4bd1-803f-5d580601ed78","segment_index":1,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"[music]"},{"id":"e03d98f7-5911-4433-a91c-9a4d461e50bd","segment_index":2,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"We are looking at a 5,000 share order."},{"id":"83de807e-9e98-4d17-abdc-1043dd8bea87","segment_index":3,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"[music]"},{"id":"4cd7a733-30e7-4645-8767-cd72bc45e3c0","segment_index":4,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"Hello and welcome back to another"},{"id":"f929527d-c295-43c8-914a-691d712ff6f9","segment_index":5,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"episode of Google Cloud Live. I'm Vlad"},{"id":"9a884ee1-bf25-4b9c-85b9-469084b83c57","segment_index":6,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"Kallesnika, developer relations engineer"},{"id":"c9f83d15-598d-4dcb-8282-5385b7177efc","segment_index":7,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"at Google Cloud and today I'm here with"}],"content_asset_counts":{}},"source":{"id":"06ffb599-3006-40fa-911f-d6276f0fab54","workspace_id":"d9654309-c206-4820-9522-1886720e58c4","name":"How to build a continuous evaluation pipeline for multi-agent systems with Gemini","slug":"yt-WRU7-4bpZkg-d9654309","source_type":"youtube_video","enabled":true,"base_url":"https://www.youtube.com/watch?v=WRU7-4bpZkg","feed_url":null,"domain":"youtube.com","jurisdiction":null,"authority_tier":"tier_d","source_class":"exploratory","access_posture":"fully_fetchable","discovery_confidence":null,"original_discovery_confidence":null,"approval_basis":null,"promotion_path":null,"discovery_context_json":{},"content_type_label":null,"cadence":null,"fetch_method":null,"last_ingested_at":"2026-07-01T06:12:00.175062Z","archived_at":null,"last_run_status":"succeeded","document_count":1,"recent_document_count":1,"policy":{"layer":null,"fetch_strategy":null,"effective_cadence":"weekly","is_event_driven":false,"last_checked_at":"2026-07-01T06:12:00.175062Z","last_changed_at":"2026-07-01T06:12:00.175062Z","next_check_at":"2026-07-08T06:12:00.175062Z","change_state":null,"cache_etag":null,"cache_last_modified":null,"cache_status":null,"stale":false,"due_now":false,"due_reason":"within_cadence"},"evidence_posture":{"origin_lane":"search_council_exploratory","source_class":"exploratory","trust_posture":"exploratory","evidence_class":"source_evidence","access_posture":"fully_fetchable","promotion_status":"candidate","admissibility_status":"context_only","evidence_floor_status":"supporting_floor","evidence_floor_reason":"Exploratory and promoted material remains below the primary evidence floor until it earns stronger trust.","summary":"Exploratory and promoted material stays below the primary evidence floor.","reasons":["Explicit access posture hint: fully fetchable.","This source is still exploratory and has not cleared the promotion floor.","Exploratory sources remain context-only until they are promoted through discovery.","Exploratory and promoted material remains below the primary evidence floor until it earns stronger trust."]},"source_reliability":{"score":42.6,"band":"guarded","summary":"Source reliability is guarded at 42.6/100.","reasons":["Authority tier is tier_d, contributing to a guarded reliability posture.","Access/admissibility posture is context_only, so Orbital scores reliability with that trust ceiling in mind.","The source has 1 documents in Orbital, including 1 in the recent window.","This source came through discovery/promotion, so reliability is intentionally capped below a strong curated source unless history accumulates."],"factors":[{"name":"authority_tier","value":7.0,"reason":"Higher authority tiers carry more baseline reliability."},{"name":"source_class","value":4.0,"reason":"Curated and manually approved sources start from a stronger trust base than exploratory promotions."},{"name":"admissibility","value":4.0,"reason":"Access and admissibility posture should raise or limit downstream reliance."},{"name":"coverage_history","value":2.6,"reason":"Sources with more durable document history are more reliable than one-off appearances."},{"name":"operational_health","value":6.0,"reason":"Recent ingestion success is a bounded proxy for source stability."},{"name":"overclaim_risk","value":0.0,"reason":"Low-access or lightly observed sources should be scored more cautiously."}]},"lifecycle":{},"config_json":{"title":"How to build a continuous evaluation pipeline for multi-agent systems with Gemini","video_id":"WRU7-4bpZkg","channel_id":"UCJS9pqu9BzkAMNTmzNMNhvg","fetched_at":"2026-07-01T06:08:11.917416+00:00","description":"Check out the codelab discussed in this episode here. → https://goo.gle/prai-rs-2\n\nStop guessing whether AI agents are actually working. Manual vibe checks might be fine for prototyping, but production requires real data. On the next Google Cloud Live stream, the team is showing developers how to move from subjective testing to data driven assessment using Gemini Enterprise Agent Platform Pipelines and Cloud Run functions. Join Vlad Kolesnikov and Leonid Yankulin to learn how to build an automated regression testing pipeline for your distributed multi-agent systems.\n\nWatch along and learn:\n* Data driven assessment: Implement adaptive rubrics and tool use quality metrics to rigorously evaluate AI agents.\n* Shadow deployments: Safely deploy AI agents to a private tagged revision in Cloud Run.\n* CI/CD automation: Integrate continuous evaluation into pipelines to ensure code changes never degrade any agent's proven quality.\n\nDon't let invisible regressions break your production workflows.\n","topic_seeds":["AI Systems Discovering Non-Obvious Model Configurations"],"caption_kind":"asr_auto","channel_name":"Google Cloud Tech","published_at":"2026-06-30T17:27:56Z","transcript_text":"[music]\n[music]\nWe are looking at a 5,000 share order.\n[music]\nHello and welcome back to another\nepisode of Google Cloud Live. I'm Vlad\nKallesnika, developer relations engineer\nat Google Cloud and today I'm here with\nLeonid.\nLeonid.\n>> Hi. Hello Vlad. Uh thank you. Uh I'm\nLeonid. I am a developer advocate at\nGoogle Cloud working with Vlad uh in the\nsame team. Uh I specialize on uh AI\nworkload uh security and observability.\nBut today I will be talking to you about\nuh data and agent evaluation.\n>> Agent evaluation you know we say from\nVIP checks to datadriven evaluations. Uh\nwhat do you think VIP checks are? How\nwould you describe them?\n>> Thank you. Uh and let me switch to the\nslides for a moment. uh so when we uh\nspeak about uh V evaluation\nuh or before that uh I want to address\nis a question why do we need to evaluate\nAI agents in the first place today\nperception is that AI is smart enough uh\nto do the work and uh so why do we need\nuh to test AI uh the main problem uh is\nthat uh Today people don't believe AI\nlike they did in the beginning when it\njust came. Uh I believe that our\naudience also don't think that uh AI is\nperfect and never makes mistakes.\nSo uh when we speak about wipe check we\nusually speak about uh testings that we\ndo manually. We run uh our application,\nour agent. We ask a few questions uh by\nmanually typing or feeding the\ninformation to the agent. And uh then\ndepending what kind of answers we get.\nIf it is good, we assume that everything\nis works and if they are bad, we return\nback to the uh code or whatever\nconfiguration of the agent uh we work\nwith and we try to adjust some uh\nsettings, configurations, prompts and so\non. So basically, you know, it it's kind\nof similar to testing um normal software\nand you know, if if you just write a a\nregular deterministic software, you can\nmake a change and then you can just like\nrun your application or your service,\ntry a few things like your happy paths,\nsee that they work and you just like\npush it to production. And we know that\nit often breaks with AI, I guess, uh at\ngenerative AI these days. uh it's even\nworse because the AI is like generative\nAI is fundamentally not deterministic.\nSo even when we try something uh and it\nworks at the moment that may not work uh\nnext time. So kind of\n>> it is true.\n>> Yeah. So basically the importance of\ntesting uh or checking um it increases\nand why why don't we just like test it\nmanually right like like we do with the\nwith the regular software remember like\nin the past we had all those testers and\nand then uh we end up um having\nautomated tests and then uh people\nstarted talking about unit tests um so\nthat's kind of you know that this is all\napplies u to traditional software What\nabout AI these days?\n>> So, uh I have news for you. We still\nhave this people.\nWe still have K departments that do uh a\nlot of uh tests manually like uh running\nby themselves. Uh and to be honest, as a\ndeveloper advocate, I often do this kind\nof uh test by running user journeys on\nour products that yet to be released.\nHowever, you are completely right. Uh\nthe main reason that uh manual testing\ndoesn't work well is actually twofold.\nUh first and uh it is very trivial uh to\ntest all uh possible scenarios that we\nknow or expect to work like golden pass\nor critical user journeys uh isn't\neffective when we speak about uh uh\napplication or service at scale because\nuh there are too many and each time when\nwe introduce a change or like submit uh\na PR request or something like this\n[clears throat] to run it uh by people\nit is just very very slow. Another\nreason for this is that uh when people\nrun it uh again I don't know your\nexperience with QA teams. Uh in my uh\nprevious uh workplaces QA teams were uh\nfiling some uh test reports besides the\nbug that uh they found describing what\nthey actually test. They had test plan\ndocuments that they checked. I did this\ntest. I did that test. Uh but in many\ncases still a lot of uh stuff remains\nbehind the scene and so all this test\nsessions uh kind of ephemeral. Uh you\ncan't like track what actually was done.\nUh you can't uh compare against previous\nexecutions and success. And this is why\nwe actually moved uh from uh uh like uh\nfocus on manual test by the teams to\nmore automated approach.\n>> I see. I see. So um now if we look at\nthis um testing software that that we do\nor evaluation right uh with AI and\nsoftware is similar but with with\ngenerative AI it it's kind of a common\nthing and it comes from traditional\nmachine learning uh we do evaluate stuff\num during development right during the\nexperimentation stage and uh we we\nchange our prompts and then we see\nwhether\nyou know agents behave better or worse.\nUm so now um what's the difference\nbetween this experimentation and really\nthe the evaluation in uh in the form of\nyou know where traditional unit tests\nare used.\nSo uh as uh we said in general the\nmanual testing especially for AI it is\nuh much more I would say uh relevant\nuh manual testing uh for AI phase when\nwe try to tweak uh our configuration for\nAI we don't actually code AI right we\ndon't uh modify code for the model in\ngeneral\n>> we still can we still can right\n>> we can but uh when I speak about\nsoftware developers not data engineers\nnot AI science scientists\nuh we uh adjust some configuration like\nuh there is a parameter called\ntemperature which describes how much we\nallow to uh the model to uh experiment\nor go to uh some discovery interface uh\nand there are other things like uh\nsystem instructions, prompts, uh\nadditional uh information that we can uh\nplug to uh the system or some logic uh\nprovided with the agent configuration.\nuh so this is kind of uh the phase that\nwe uh can call discovery when we uh\nexperiment and uh it is hard to uh\nalgorithmize\nuh experiments. So it is naturally uh\nhave a creative and open-ending uh natur\nuh approach which uh makes sense to do\nmanually uh to have a a person uh doing\nit versus the agent. uh on another side\nwe have uh our CI/CD process that have\nto be rigid and repetitive and this\nprocess\n>> yeah ju just for for the audience like\nwhen we say CI/CD we mean like\ncontinuous integration\n>> and continuous deployment as far as I\nremember right\n>> and this is where\n>> yeah this is where effectively our uh\nsoftware in this case our agent is sort\nof like done or almost done right in and\nit's in the process of final tweaks or\nit's in the process of being uh\ncontinuously improved because the you\nknow this is what we do we continuously\nimprove our software we test new models\nwe do little tweaks uh with our prompts\nuh or systems instruction etc etc right\nso I\n>> I would say yes and no actually because\nuh when we say agent uh today uh this uh\nhas certain ambiguity because the engine\nincludes some interaction with the\nmodel. I can simply change the model I\nuse uh and start using a different model\nwhich can completely change uh the\nbehavior of the overall application. But\nI can also go and uh add uh some uh\ntools uh some instrument in instrument\nthat model can invoke through uh some uh\nthinking process. Uh the model can uh be\nmade aware by the agent that it can uh\nuse some tools and these tools\nessentially implemented as uh simple\nfunctions as uh in the body of the same\nagent. part of the same uh application\nor it can be implemented as some remote\nservice that uh the agent can call based\non the request from the model and I can\nchange this uh functions uh simply and\nuh by this I again can influence how\nthings work. So uh sometimes when we\ncommit the change we make the change we\nuh we don't know uh how it will uh\naffect overall uh work of the agent\nespecially as we mentioned uh due to uh\nunpredictability\nuh and uh lack of determinism of the\nmodel. Additionally, uh I want to jump a\nbit aside and speak about uh DevOps work\nthat includes uh testing uh of the whole\nsystem, some integration testing, system\ntesting. Uh the terminology can vary but\nuh it means that uh maybe we make a\nchange to a completely different\ncomponent of the overall application,\nnot the agent itself. that still can\naffect one way or another how the agent\nbehave and we still want to test it. As\na result uh uh we want on any uh change\nthat gets submitted to our source\nmanagement system\nintegrate the testing of the agent. And\nthis uh testing can be done manually or\nbetter say it can be done manually but\nit then uh inherits the problems that uh\nI previously mentioned very slow uh\nephemeral without tracking without\ncomparison and uh again uh the testing\nis uh kind of uh hard to do because it\nrequires uh quite a lot repetitive uh\noperation and we need it to be uh\nrigged.\n>> So essentially what I'm kind of seeing\nand when I when I look at this uh\ndefense mode of continuous evaluation\nis it's very similar to a combination of\nunit tests and integration tests. we\neffectively have a set of happy paths\nlike right the deep and I will\noversimplify the prompt and the\nresponses whether it's individual agent\nuh in a in a multi-agent system or it's\nthe whole multi-agent system that gives\nyou a response after going through a\nnumber of uh you know iterations\ninternally uh we want those heavy paths\nto work despite\nus making some changes right because we\n>> yeah it A very good point. Uh I uh saw\nalong the starting 2024 when uh ZI\nemerged uh as a new way to to implement\napplications\nuh that uh there were many discussions\nabout how we do it with AI, how do we do\nit a new way. But uh in my personal\nopinion uh it is the same old way. We\nhave the same unit tests. Uh we have the\nsame methodology. We just uh adjust a\ncertain implementations\nto the fact that uh AIdriven workloads\nthey uh behave differently. uh to your\npoint uh the main challenge uh to\nimplement uh the standard uh success\npath is uh the same problem that uh\nit is not always easy to do with AI\nbecause AI is probabilistic. I believe\nwe will repeat this uh mantra along the\nwhole uh broadcast as a reason for all\nthings that we do or we show. So\ncorrectness verifications in uh with AI\nuh requires uh what we call semantic\njudgment because AI uh doesn't\nessentially returns uh one for success\nor zero for fail. it can return a\nvariety of information even when we\ninstruct it uh to return information in\nsome structured form instead of uh\nhaving it return in the free text.\n>> Um it's it's good that you mentioned\nthat because now we can start talking\nabout like ways that we test AI, right?\nSo what are like obviously we have\ntraditional software where we can just\nlike run some code and uh measure you\nknow the results and it's usually you\nknow remembering that how we write unit\ntests or some method was called or uh\nthe return value contains this string or\nthis value equals you know one to three\nwhat's the difference with with AI and\nand what's the approach to do it in\ngeneral and then we can talk about doing\nit at scale.\n>> Okay. Uh so I actually uh thinking uh\nwhere will be the best way to start. We\nhave uh\ntwo pairs of categories. Uh probably the\nbest way to describe it. I want actually\nto start with uh\nthe yes with uh codebased versus model\nbased uh categories with graders\n>> uh because it uh kind of continues the\nprevious discussion about unit testing\nwith the old style and new style. So uh\nthe interesting point is that despite\nwhat I mentioned previously that uh AI\nuh doesn't return uh very deterministic\nuh responses. Uh it is still possible in\nsome cases implement uh the standard uh\nverification mechanism that follows a\nprecise and deterministic path. uh\nbecause uh some qualities uh of the AI\nresponse still can be enforced. Uh for\nexample, uh we uh we can uh enforce uh\nuh AI model to return all responses in\nJSON according to some data model.\n>> Oh yeah, that that's structured output,\nright? So we we can we want to make sure\nthat that's still the case, right? true\nand then uh we can define a codebased uh\ngrader or validation process that uh\ncomes and check if the JSON uh structure\nmatch the model that we actually asked\nbecause again uh model behavior is not\ndeterministic. So model for some reason\ncan decide that for this particular\nrequest it makes sense to create a diff\nadditional field to the JSON because it\nis will be very usable according to and\n>> or the developer decides to change the\nmodel because some model works faster\nbut it's not that great with with\n>> you know with structured output and then\nwe end up\n>> yes and someone could decide yes that\nthey will change it without uh properly\ncommunicated along the all uh\nimplementation logic. Uh so we can still\nuse uh code-based deterministic\nvalidations. However, for the rest of\nthe stuff like uh if we actually need to\nanalyze uh the content returned uh by\nthe model uh we need to use uh so-called\nmodelbased graders uh or they are also\noften referenced as LLM as a judge uh\nwhich use uh a strong AI model to\nevaluate uh the information according to\nsome uh criteria that we define as part\nof the code. Now uh the important point\nhere is that this strong model by uh\nit's not a requirement I believe but it\nis very uh strong recommendation it\nshould be\nstronger okay I strong too many times\nbut anyway uh it should be stronger than\nthe model used by the agent so uh using\nthe example with Gemini we have Gemini\nfree flash model very good model with uh\nmultiple levels of reasoning uh\nsupporting uh multimodality inputs uh\nand we have a Gemini uh 3 pro model\nright now they are 3.5 I believe uh\nspecifically but uh the pro model is the\nmodel with much larger context and a\nmuch uh uh I don't know how better\ndefine it more depth probably I'm not AI\nengineer. So if I talk gibberish at the\nmoment, I beg your pardon. Uh I only\nknow that the pro model is much stronger\nthan the flash. This is my level of\nexpertise.\n>> It it it is true. It I mean in general\nit it's fine. Uh sometimes especially\nlike when you when you do this at scale\nto use u not necessarily the the the pro\nversions. What's important is is that\nthe the model I would say this the the\nthe checks are not part of the the the\nexecution flow themselves because right\nif you just take u your usual LLM ask it\nto do something and then at the end or\nyou you can even add make no mistakes in\nthe prompt right um you still need\nanother sort of like um independent\nentity to look at it and tell whether\nsomething makes sense or not. In the\nsetup that [clears throat] excuse me in\nthe setup that uh we have in terms of\nthe the system instructions and the the\nthe input that model received and then\nthe output.\n>> Yeah. Um thank you.\n>> So now um yeah you you you want to talk\nabout reference based on reference. So\nuh I mentioned uh criteria in the\nprevious uh uh when we spoke about uh\nsorry when we spoke about graders I\nmentioned that uh for model uh it uses\ncriteria. So when we define criteria, we\ncan define them uh\nin such a way that uh the test uh\nreferences uh something like a ground\ntruth or a golden data set when it\nevaluates the criteria or it can use a\nreference free uh criteria which is some\nkind of uh universal.\nuh\nas an example consider uh that we do\nexpect for example ah and this uh\ncriteria this uh metrics uh can be used\nuh with any types of the graders\nobviously. So uh you can consider uh\nusing for example for the uh codebased\nuh grader uh we can deter uh use a\ndeterministic uh validation that the\nJSON that uh the model returns is well\nuh formed. So the syntax of the format\nis uh preserved and uh the parser will\nbe able uh to process it and vice versa\nfor the uh model based uh we can provide\nreference that uh includes some uh\nspecific information that should be\nmentioned examples uh it can be\ninformation about uh\nI I don't know uh about chemical\nelements or whatever.\n>> Yeah, just to give a few examples. So\nthe reference based metrics is\nessentially when we sort of like have\nthe right right response, right? If I\nask uh the agent about and this is the\nexample that we're going to use like a\nhistory of Rome, right? uh I may have a\ntext with the with the expected response\nand then the um this grader will look at\nthe at at the question and at the\nexpected answer and then then they\ncompare it right uh uh\n>> and I guess the reason we use metrics\nand not just like a straight comparison\nuh is because u usually like due to this\nnondeterministic nature like you said\nwe're going to mention it a lot of times\num they would normally never be the\nsame. So we have we need to compare them\nsomehow and then establish um this sort\nof similarity metrics like oh how\nsimilar the answer is right that's\nthat's sort of like one thing another\nthing like with the with the tools again\nlike we we have we have u this prompt\nand then we have u some expectation that\nif we if we ask a question the tools\nwill be called and uh there's going to\nbe parameters And there's going to be\nresults of the tools. And again, they're\nnot always the same. They still do the\njob. They still do the task. And from\nfrom anyone's point of view, um, it it's\nright. It's just it's not going to be\n100% right. That's why like in\ntraditional machine learning, we\nintroduce this evaluation metrics where\nand then we decide for each metric uh\nand like what what's going to be the\nvalue that we consider. Okay. Is it 90%,\nis it 75% that is right? And once um you\nknow it's gonna it's it's not right. Uh\nwe like we can we can say no this this\nchange actually broke the agent or if we\npass then we say okay fine we go. Um\nwith the reference free three free\nmetrics we essentially ask the agent to\num to verify the answer without without\nhaving the the right answer right and\nand that um example that you gave with\nuh whether it's JSON or not like valid\nJSON right we don't need we don't need\nto have a reference we just need to see\nif if it's valid JSON or if it's if it\nmatches a certain schema right we don't\nneed to right answer. We just need to\ncheck whether uh it's it's in the\ncorrect form. And uh we have different\nways to um to to use this reference free\nmetrics with something that is not\nnecessarily as rigid as JSON, right? We\ncan uh we can see if the text for\nexample is fully grounded in in the data\nthat you know agent acquired through you\nknow sub agents or through through some\ntools right we can evaluate the\ngrounding whether it's with some\ntraditional and like um NLP metrics such\nas blue right when we just like compare\nwords and and you know and allow it to\nbe different but not too much or we can\nuse an LLM grader uh and ask it to see\nif there are any facts in the text in\nthe result that are not grounded in the\nsource data. Right? Because the models\nthey they may be very creative. They may\nuse and they they they do use all the\ntime um data that they were trained on\nwhich may not necessarily be true,\nright? And especially if they see a\nfamiliar topic uh they may enhance the\nanswer with their own knowledge that um\nwas not necessarily given them u during\nthis you know inference process inside\nthe engine and again like I said the\nknowledge may not be true and that you\nknow that applies to a lot of um uh\nthings starting from like I don't know\num soccer or football player stats uh to\nsome laws and regulations or you know\ntax code etc. It changes all the time\nand uh every model is sort of frozen in\ntime with the data that it was trained\non. So we need to enhance it uh with\nadditional tools and data sources and\nuse all these techniques you know genic\nrag material augmented generation etc\netc. So and we need to be able to check\nwhether\nuh what we did like those changes\nwhether they broke this flow and the\nmodel started producing you know model\nstarted being too creative right\n>> yes uh Vlad I am afraid we start losing\nthe audience with all the theory uh what\ndo you say let's move to [laughter]\n>> implementation part so uh would you\nshare with us uh how we're going to uh\num to demonst demonstrate all this uh\nprocess of testing.\n>> F first of all uh we actually have uh a\ncode lab that people can try themselves.\nSo everything that we show today um you\ncan try yourself and if we switch from\nthe slides to uh the code lab um that\nI'm currently sharing\nuh it it it will be shared as a link uh\nas a resource uh for this broadcast but\neffectively everything that we're going\nto talk about is uh is in this code lab\num and uh there is a repo for this code\nlab that um you can clone clone uh you\ncan clone the the kind of the full the\nfinal lab or you can clone the starter\nbranch and you know go through the\nthrough the lab yourself. Um and yeah we\ntalk about the the evaluation we talk\nabout deploying those agents and testing\nthem as as part of the continuous um\nintegration and deployment process. Uh\nbut I think uh we can just you know\nswitch to to our agent and and show it\nin action and uh as well as you know\nbasically tell people what what that\nagent does and why\num and why it does it this way. Right.\n>> Yeah. Uh just one uh note that I wanted\nto add. We uh we are not going to run\nthis lab uh right now. We did a quick\nshortcut to save time. Uh and we\nenrolled and prepared all the setup\nsteps up to probably uh up to uh step\nseven.\n>> So we will uh go and review uh the code\nhow it implements and how it uh match\nversus the short theory introduction\nthat we did. Uh you can uh later watch\nus again if you enjoyed that uh seeing\nit or you can uh read through the someh\n>> okay I'm going to share my anti-gravity\n>> uh I believe l that anti-gravity is your\nuh platform of choice but it is not\nnecessary to implement this right\n>> it is not just uh but this is where I\ncode anti-gravity ID\n>> so let's talk about the agent first um\nand the the structure of the project in\ngeneral so this is a multi- aent system\nuh that has multiple agent um and that\nalso has the orchestrator agent that\nkind of you know baby seats those little\nagents and if we look at the code let me\nactually look at uh the orchestrator\ncode that it's better to show than than\ntell about it, right? So the\norchestrator\num it's effectively a you know one of\nthose um deterministic\num workflows uh that uh agent\ndevelopment kit supports. Oh that is\nimportant to mention. We obviously\ndeveloped our agent using agent\ndevelopment kit. Uh so it it's a\nsequential agent that has two sub aent\nand by the name of the uh this class you\ncan tell they run one after the other.\nWe have a research loop that kind of\nlooks for data um evaluates whether it's\na good data and may like repeat the the\nprocess of discovery and we have a\ncontent builder that builds the content.\nWhat kind of content? whole our whole\nsystem is actually an education course\nbuilder. So the user specifies the topic\nthat they want to learn. I don't know\nlinear algebra and um our agentic system\nlooks for information about this topic\non Wikipedia, right? Using Wikipedia\nsearch and the information that it has\nand then\n>> a tool a tool.\n>> Yeah, it's a tool. It's a tool, right?\nAnd that's what researcher does in our\nloop engine, right? And then we have\njudge that judges whether what\nresearcher found is sufficient for um\nyou know for building an educa like a\nplan of the education course and then we\nhave an additional leadal agent that\njust checks for escalation like when the\nagent says it's good like we can break\nthe loop and finally uh go to building\nthe content. So\n>> effectively\nuh Vlad uh effectively it looks like we\nuse this LLM as a judge pattern for our\nown benefit in this application. Uh it\nis not related to the testing but uh\nit's just used uh for us to to validate\nthe results very similar to how we are\ngoing to evaluate. Right.\n>> Yeah. Yeah.\nAnd uh\nI'm I want to just demonstrate the agent\nin action. Um and for that I have to\nstop sharing again and share\nthe creator, the course creator. So\nlet's say I want to I want to learn\nabout history of Rome, right? That's a\nlittle web app that is built on top of\num the the agent. So essentially as you\ncan see um internally it start from\nresearching the topic then it will go to\nfactchecking and then it will go to\nwriting the the actual course. The\nfactchecking is really the judging. It\ndoesn't necessarily checks the fact it\ncheck it checks whether the you know the\nthis information that we're going to\nbuild the course on is um comprehensive\nenough. So it checks for a few things.\nSo while it's doing it it it its work uh\nlet's look at um you know those\nindividual agents and we start from the\nresearcher because\n>> uh maybe we want to uh to check uh what\nthey do uh what we want to test actually\ninstead of uh exploring the the\napplication code and look on the code\nfrom the perspective what we actually\nvalidate. What do you think? Yeah. And\num and that's that's why we start with\nthe researcher because that one of the\nagents that we going to uh we're going\nto test, right? So the researcher agent,\nit's a it's a very relatively simple uh\nLLM agent with certain instructions with\nthe Wikipedia search tool that we give\nit, right? And um effectively when we\nwhen we enter this like you know history\nof prompt uh the agent goes ask\nWikipedia for this topic and then in\nfact almost like returns almost whatever\nthe Wikipedia search um uh return to it.\nIt's a it's a very simple thing although\nwe still need to make sure that um for\nexample it doesn't hallucinate uh with\nadditional information that are is not\nnecessarily correct. Um if we look at\nthe judge right which we're not going to\nevaluate separately\nuh we we can see that it evaluates\nwhether it's like it it's high quality\ninformation and uh um if if facts are\nsupported by data etc etc. Um and then\nfinally uh for the for the content\nbuilder it gets the information that uh\nwe received and uh builds the the course\nin a certain format. That's that's what\nwe do. And like we we when we test agent\nas a whole, we basically test the result\nthat the content builder provides us,\nright? And and we in the in the\napplication, this web app um that we see\nuh we we will we will see the result of\nthe content builder agent.\nUh they're still working by the way. Um\nso now how are we going to test this? um\nfor for this lab we build um a set of\nhelpers I would say right and uh if we\nthink about different approaches and\ntools that we can use for testing agents\nwe obviously need to talk about\nframeworks and systems and fortunately\nuh in Google cloud we have a Gemini\nenterprise agent platform uh which we\nuse Gemini through uh but it also has um\nGenai evaluation service and geni\nevaluation service consists of the you\nknow the SDK and the service in cloud\nright the SDK helps you um you know set\neverything up kind of wire your agent uh\nand even test um test things locally if\nyou want and with the genai evaluation\nservice in the agent platform um you can\nyou can run those tests at scale today\nwe're going to run things locally just\nto you know to make it faster Because uh\nin our case whether we test researcher\nand by the way this is our test data and\nthat I prepared uh or we taste uh test\nthe orchestrator as a whole uh we have\nvery little data so we we can sort of u\ntest it locally and um when we test\nresearcher uh you can see that we're\ngoing to use a few reference based\nmetric because we have a references\nuh for the answers as well for the tool\ncalls. for example. Uh, and for the\norchestrator, you might have noticed\nalready that I only have a prompt and\nthat's it, right? I know the the right\nanswer. It's just it's just it. So,\n>> yes.\n>> Uh, can I interrupt you for a second?\nUh, I probably missed it. Uh, where do\nwe use this uh JSON files? How do we\nfeed them into the system? We feed them\nuh through the through the geni um\nevaluation SDK that um agent platform uh\nfolks build for us and uh in the code um\nyou can see in in there's a shared\nfolder there's this evaluate.py PI that\nessentially reads a JSON file and uh you\nknow calls um all the\nyou know all the SDK methods to such as\nlike creating evaluation run uh to\nactually submit the data to the\nevaluation service or the evaluation\nSDK. So effectively what we yeah\n>> sorry for interrupting you again. So\njust uh uh help me to understand uh\nwhere the evaluation actually runs like\neach each step of it what do we actually\nrun on the agent platform?\n>> Oh that that's a great question. So\nfirst since our agent especially now it\nruns locally on my machine\num um or it can run in in our case and\nwhich we provide the code for in cloud\nrun. Uh first step that we do we run the\nagent inference right so we effectively\ncall the agent ourselves in this run\nparallel inference function.\n>> So our test uh does this uh execution\nright? Yeah, it's either runs on my\nmachine or it runs in I don't know in\ncloud build whatever CI/CD system you\nhave. You actually call the agent right\nfor real. Uh and it's not the geni\nvalidation service that runs your agent\nor calls your agent. You because you\nneed answers from the agent that the one\nthat um ones that agent actually\nprovides when you you give it the\nprompts. And uh the beauty of this is\nthat you run your agents in in kind of\nreal environment. It's either real\nenvironment meaning it's it's not like\nwith unit test when you call functions\nindividually and you create a bunch of\nmocks etc. No like you call LLMs and\nLLMs produce answers right and it's all\num it's all um\nuh for real. it it's it's not some mock\nthings that we that we invent. It's all\nfor real. We collect those uh that data\nset and mix it with our test data.\nRemember the test data is this little\nJSONs, right? So it's it's prompt, it's\nreference answer and then reference\ntrajectory which is tool trajectory that\nwe expect for this prompt, right? And we\nwe do this for real. Once we have this\ndata set with prompt reference answer if\nwe have it and then the real answer, we\ncan submit this to uh the evaluation\nservice or evaluation SDK if we do it\nlocally and just ask uh to you know make\nthose judgments, make those runs uh with\ncustom metrics etc. Um and uh then we\neffectively can produce the evaluation\nmetrics and then decide whether the\nvalues of those metrics are high enough\nto you know to sort of pass the test.\nRight. Hopefully I answered your\nquestion.\n>> Yeah. Thank you.\n>> Yeah. Meanwhile, our agent finished the\nthe the course. I'll briefly show you\nhow it looks.\nHere we go. So yeah, I asked for history\nof Rome. It got a bunch of info from\nWikipedia and then it built a plan for\nyou know for in 10 modules to u to\ncreate a course of uh history of Rome.\nAnd this is not the whole course\nobviously right it's just a very\ndetailed plan that you can use to to\ncreate you can create another engine\nagent that actually builds every single\nmodule and this way you create the whole\ncourse.\nAll right. Um, switching back to our ID.\nA lot of back and forths, but that's\nthat's what we do as developers, right?\n>> Uh, hopefully we do it less with AI.\n[laughter]\n>> So, yeah. Uh, we have uh we have our\nagent um uh running already because we\nwe tried it with the website. Uh let's\nactually run the evaluation. And if we\nlook at um um our individual evaluation\ncode right evaluate agent uh you can see\nthat we evaluate two agents like there\nis a piece that when we evaluate the\nresearcher which is the very first agent\nthat is being invoked I would say very\nfirst LLM agent that is being invoked by\nthe orchestrator because orchestrator is\nessentially like a deterministic\nworkflow inside um ADK right and we try\ndifferent metrics with this some of of\nthem are reference based for example\nresponse match it evaluates whether the\nresponse of the agent is close enough to\nwhat we use as a reference answer some\nof them is reference free such as\nresponse quality it just assesses the\nthe quality of the response like in\nEnglish language whether it's good or\nnot same with tool use quality right it\nassesses where the tool are used\nappropriately and there are some uh\ncustom um metrics that I created to uh\nto calculate trajectory precision\ntrajectory recall. This is for tools,\nright? Whether uh our reference tool\ntrajectories um sort of like reflect the\nreal um the real trajectories that we\nhave that's for the researcher. Vlad uh\ncan you uh elaborate to people like\nmyself uh who think they understand what\ntrajectory means in this context but not\nentirely sure? Trajectory tool\ntrajectory is essentially a sequence of\ntool calls like whether you you call\ncertain tools in in certain order right\nuh the arguments of those tools and the\nresults of those tools especially if the\ntools are like entirely deterministic\nright that's what that's what uh and\ntrajectory is like when we say about\ntool trajectory it's essentially the\nsequence and then the way you can\nmeasure it is different like if you\ndon't care about the the order then you\nmay say oh I want to evaluate uh the\ntrajectory\num I I want to have these tool calls in\nany order right or I want to I want to\nhave I want to make sure that this tool\ncall was in there I don't care whether\nthere were other tool calls or not so\nyou can combine this\n>> understand\nand these are custom metrics uh the\nplatform doesn't provide this specific\nUh uh\n>> yeah I literally\n>> rubric sorry not matrix rubrics.\n>> Yes. Okay.\n>> And then um we also have the\nuh the orchestrator where um I in this\ncase we just we just check for\nhallucination and the hallucination\ngrader right the grader that runs u as\npart of the geni evaluation platform. uh\nit checks whether the response is\ngrounded or there are pieces that are\ncompletely made up and they're not\nactually in in this initial data that\nwas returned um in our case by\nWikipedia.\nSo let's run the evaluation.\nLike I said, we're going to run\neverything locally. Um it's going to\ncall it our agent that runs local host.\nSurprise surprise runs on my machine.\nUm and uh it's it's going to take some\ntime and while it's it's running we\nactually can talk about different kind\nof rubrics right we say like rubric\nmetric what is even rubric metric\n>> um let's call yeah and the the rubrics\nit's essentially\nlike a prompt for judges I would say\nthat's that's what I like to call it\nmaybe it's not the most precise\ndefinition but I like to that's how kind\nof I like to call it it's prompts for\njudges and the funny thing that there\nare there are static rubrics and there\nare dynamic rubrics what's the\ndifference between them so the static\nrubrics is is when we um effectively I\nknow hallucination is a great example\nwhen we say when we tell to the judge\nlike oh here's here's the prompt here's\nthe result see if and here's the the you\nknow the Intermediate steps such as\ncalling different tools etc. And we want\nto see if all these data is grounded\nlike please LLM go and check if every\nsentence or every paragraph is actually\ngrounded in this initial data so the\nmodel doesn't hallucinate. The dynamic\nrubrics is is is\ndifferent. We effectively\nlook at the question, look at the\ndescription of the agent and then we\ntell the another LLM generate the prompt\nfor the judge to evaluate this result.\nSo basically it it it happens that the\nprompt for the judge essentially the\nthings that this um this um judge is\ngoing to check for depends on the\nquestion and I'll give you an example.\nLet's say I'm I'm running like a an\nonline store and I sell headphones right\nand I I\nask the agent how much are these kind of\nheadphones you know earbuds how much are\nearbuds? So in this case when when the\nsystem the the evaluation service sees\nthe question and when I say that I want\nto use a dynamic rubric it's going to\nform the judge question in the form\ncheck if the answer contains price for\nthe headphones\nfor the earbuds right that's that's how\nthis dynamic prompt is going to be\nformed and for every like let's look at\nthe researcher for every question that I\nhave\nthis prompt for the judge is going to be\ndifferent because the judge will be\nchecking whether the you know the\nquestion is actually has answer for uh\nsorry the answer is going to have the\ndata uh that fulfills the question\nthat's the dynamic rubric\n>> so let me understand we previously\nmentioned the problem of the uh\nprobabilism and uh like lack of\ndeterminism\nuh when models involved if this uh uh\nrubric generates\nuh the specific uh prompt or specific\nresponse each time. Does it mean that on\neach execution it will be different?\n>> Pretty much. And then how do we\nguarantee that different runs of the\nsame uh uh of the test with the same\nrubric will provide us with like similar\nuh level of uh validation if you wish\nhow does it help us if uh it is not\n>> yeah that's a great question we we\nobviously cannot guarantee nothing in AI\nworld is guaranteed uh okay but uh the\ngood thing is that we have we can do\nthings at scale. So instead of\nguaranteeing something we just do\nsomething a lot of times right and\n>> uh if you read the documentation for\nexample about this different metrics\nthat we use different rubrics you will\nsee that underneath\n>> that question that we ask judge to\nanswer is actually asked many times\nright and then and then we sort of like\n>> um u\n>> there's this\n>> so each prompt\n>> each prompt um doesn't mean like\ncompared to the uh conventional software\ntesting when we have each test and we\nhave a collection of inputs for example\nuh that we can match to the collection\nof prompts in our JSON uh but in uh\nconventional uh software testing uh each\nuh test will be executed for each prompt\nonly once but in our case we will\nactually have each test executed\nmultiple times. Is it what you are\nsaying?\n>> Yes. And then we're going to take like\ndepending on the on the metric we're\ntalking about like we're going to take\nthe average of whatever the the right\nthing in that situation is.\n>> Yeah. So we we we went through the first\num iteration with the researcher\nevaluation and now we can actually and\nwe can actually look at what\num what that evaluation produced.\nLet me show you.\nOkay, I have to\nYep, that's the one the local. So, this\nis um this is the report of evaluation,\nright? And uh these are our metrics\nfinal response match the quality tool\nuse quality etc. Trajectory call and you\ncan you could see if you notice on the\nscreen that everything was great\nobviously because that that's a heavy\npath. That's a demo and surprisingly it\nworked this time. Uh but you can see how\nthe agent actually uh sort of like\nexplains\num like why\nthe LLM thinks that it's all good,\nright? So the final response match it's\nessentially says that it matches the\nexpected response like\nI would say enough. So it's obviously\nnever the same, but it's it matches it\nenough to say that yeah, it it matches.\nAnd then response quality is good\nbecause it provides sort of like uh the\ninformation that the user asked for in\nthe form that it asked for. Uh and then\nthe tool use quality we call the tools\nthat we expected and uh the trajectory\nprecision. There's no explanation\nbecause it's it's calculated by you know\nPython code. And then deterministic,\nright?\n>> Yes. Yes.\n>> Uh meanwhile, my hallucination check\nalso passed and I I really want to show\nyou that because it's not 100%.\n>> It is not a usual I believe with\ndeterministic or with hallucination,\nright?\n>> Hallucination, it's a very heavy metric\nbecause the answers are big and it's a\nlot to check. So, let's look at that\none. And you can see that it's it's not\n1.0. What's the what's the result? Let\nme look.\nYeah, it's like 0.82,\nright? So something like the model\nactually hallucinated a few things but\nkind of [clears throat] accept at in the\nin the acceptable manner. So we see that\nfor example there is a failure.\nHopefully you see like let me increase\nit even.\n>> Yes I I can see but I think we can uh\nzoom in. uh can you expand and see what\nis the rational behind it?\n>> Yeah. So you can see that uh the the\nengine answer that early defense is then\nlike raw subdued uh neighbor neighboring\ntribes.\nuh and then the context basically that\nwhat Wikipedia provided\num is that it was it actually happened\nin a different at a different year right\nand uh the model was able to spot it and\nsay okay that like that's effectively a\nlie like you you messed up the dates um\nthis one like the the Punic War Rome\nbuilt its first major navy and seized\nSardinia and Corsica, right? And then\nthe label unsupported means that we\nactually do not have data for some of\nthe for some of the statement, right?\nUm, again, this may be true because the\nmodel the model knows about history of\nRome obviously, right? It was trained on\na ton of data way more than Wikipedia\nhas. But what's important here is that\nit must be grounded to the data that we\nprovided initially. So if we talk about\nlike I mentioned if we talk about legal\nstuff right the laws they change all the\ntime even though the model was trained\non some version of a low law low may\nhave changed uh and we should make sure\nthat the data that we returned you know\nin our low library was used and not the\nold data that the model was trained on.\n>> It is a very good point. Uh oh. Do you\nthink that we can uh uh zoom in into\nthis uh custom uh call uh function\nexecution and uh look in it closely\nbecause uh some of the standard criteria\nuh can be used or not used depending on\nthe flow. But uh the power of uh\nintroducing our own tests especially\nwhen we speak about uh testing some uh\nintegration or some uh more\ndeterministic scenarios that we want to\nenforce uh in our agent uh can be very\nuseful I believe to many people who\nwatch us.\n>> Yeah. So that's kind of how we evaluate\nagents. Um I'd like to briefly talk\nabout uh using this flow as part of CIC\nCI/CD pipeline uh with cloud run which\nis also part of the live uh our lab our\ncode lab. Um\n>> okay\n>> so essentially like if we look what am I\nprojecting now I'm projecting\n>> uh collab\n>> yeah I'm going to project uh my\nanti-gravity window again\nso as part of our checks if we if you\nlook at the kind of the deploy script\nthat it is it deploys everything to\ncloud run deploys all the agent in our\nmulti-agent system they integrate using\nA2A protocol u interact using um A2A\nprotocol. So the the the the root agent\nuh talks to other agent using A2A and\neach agent is deployed individually.\nObviously for this toy example that may\nfeel like overkill for you but for the\nreal agent that's how you want to do it.\nAnd um our deployment script they\nactually use um the approach called kind\nof shadow deployment which um um cloud\nrun supports where you can deploy your\nservice with the so-called revision tag\nuh and in this case I used the my git\ncommit as a revision tag right and that\nrevision will be a full-blown cloud run\nservice in cloud run and you can call it\nit has uh like HTTPS endpoint etc. but\nit's not serving traffic to the regular\nusers. So the users don't actually see\nit. It's only your system the CI/CD\nsystem that knows that this revision\nexists right it has its own URL. Um let\nme show you. Uh I want to just to\nclarify uh uh a lot because uh I hear\nyou uh using uh a a new uh revision and\nnew service more or less in the same\nsentence. So when you use it uh what\nactually get created\na new service\n>> the a new service is not created let let\nme show you quickly\nwhich actually makes a lot of sense.\nI'll show you how it may look.\nOkay. While Vlad is uh going through the\ncode, uh I can uh I want to share with\nour uh watchers. If you never used Cloud\nRun, it is a serverless platform\nuh that allows you uh quickly and\neffectively deploy\na a workload as a service. Uh\nit runs uh fully managed in the fully\nmanaged environment uh on Google Cloud.\nSo you don't have to care about scaling\nproblem about allocating resources and\nother stuff. And uh I believe what Vlatt\nis trying to demonstrate is how the\nservice will look like when we deploy uh\na revision uh in the shadow mode.\n>> So multiple revisions uh practically\nspeaking uh allow us uh to manage\ndifferent versions of the same service\nand usually we have one main like\nversion uh or revision in the production\nthat gets all traffic. So here's here's\nthe here's the all my revisions. Every\ntime you deploy like redeploy a service\nin cloudr run it creates a new revision\nright and uh revision may have a tag may\nnot have a tag all my revisions are\ntagged and um by default when you deploy\na new revision if it's if if you don't\nspecify additional parameters with the\ndeployment it's going to start serving\ntraffic right so it it has 100% traffic\nserved and you can see that my February\nrevision serves the traffic. Uh when I\ndeploy um a tagged revision without\nswitching the traffic, it's getting its\nown URL. You see like that little popup\nwith the with the URL that is named\nafter revision tag and then it has like\nthis little\n>> like three\n>> three I don't know how to say it like\nthe three dashes. three dashes and then\nand then the the you know the regular\naddress. So this is where my private\nrevision is accessible through right\nthat's that's that's its individual URL.\nSo it's not when I go to this default\nURL of my cloud run service\num it's going to serve the revision that\ncurrently serve the traffic like the\ncloud run can actually split traffic\nbetween revisions so you can do AB\ntesting etc. But for for us it's\nimportant to deploy this shadow revision\ntest it fully and if tests are passed\nthen we can switch the traffic right we\ncan click manage traffic and say uh what\nwhat is the revision that I want to\nserve traffic through so we deploy it\nit's for real it's real service we can\ndo all the integration testing with all\nthe all the agents involved once it's\ngood we say okay this agent is ready\nbecause it passed the [clears throat]\ntest let's switch the traffic and\nredirect our users to this new improved\npresumably agent.\n>> So we we test on a real production\nenvironment without the risk exposing uh\nusers to the untested version, right?\n>> Yep. Yep.\n>> Uh I just want to add uh from the\nsecurity point uh there is no actual\nadditional security introduced for this\ncase. So if someone knows this URL and\nas you saw the URL is kind of generated\nusing some uh\nuh\nsignature or\n>> it's actually the the commit hash from\ngit.\n>> Yes. So uh if someone knows it uh this\nservice can be accessed. So a certain\nlevel of uh vulnerability is present but\nit is relatively low because again we\nexpect that uh without getting access to\nthe internal infrastructure it is very\nhard to guess uh and it is uh this\nrevisions are usually uh very\nshortlived.\nuh and additionally uh we another uh\ngood practice is to create uh a staging\nenvironment which uh reproduce the\nproduction environment like 100% from\nall uh uh ways except for maybe uh the\naccess from some uh external uh from the\npublic uh locations and then you can uh\ncover this additional topic. Sorry for\nlike hijacking the stream no testing to\nthe security part but it is yes it is a\nkind of things that uh important to\nremember when you utilize shadow\nrevisions.\nOkay. Uh let's move forward. We are\nabout 20 minutes before the end of the\nbroadcast and we want to cover uh and\naddress uh questions that were asked. I\nsee we have quite a few questions. So\nlet's move forward with uh\n>> uh ju just one more note before before\nwe sort of um move to questions. Um a\nlot of people ask like how do you how do\nyou come up with those you know test\nprompts whether you have reference for\nthem or not and that's like one of the\nthe most frequent asked question and the\nanswer for this is that you effectively\nneed to collect your you know what we\ncall critical user journeys or happy\npaths or golden paths like the questions\nthat you really want\nuh to work with your agent and\nthe way you collect it like in there\nmultiple ways like obviously some of it\nyou know you just write them down uh\nsome of it you have your early testers\num you see what they ask the agent and\nwhat the answers they expect uh you use\nfeedback if you have a feedback loop you\nknow those like dislike things that you\ncan say oh the agent you know was wrong\nor I don't like the answer that's also\nhow you collect it and uh finally I mean\nyou can use another LLM to uh to create\nthose uh because that you know synthetic\ndata is always helpful in um in those um\nAI workflows.\n>> It reminds [clears throat] me a bit uh\nhow we explain uh uh discovering\ncritical user journeys with uh SR site\nreliability engineering.\n>> Yeah,\n>> we actually say the same thing. So you\nyou discover it from the previous data\nthat is already existing in the system.\nIf you don't have it, you go to your\nbusiness logic. There are some kind of\ndesigns or functional uh requirements\nthat uh get fulfilled or you just invent\nit in the case that nothing like this\nexist. And uh here we can use and\nutilize uh AI. I just wanted to add also\nthat uh you usually don't have to care\nabout different uh\nuh formats or different ways to define\nto ask the same uh request uh because uh\nas we mentioned uh the uh model uh\ngrader model based grader uh uses uh\nstrong AI model. So defining a\nrelatively straightforward prompt will\ncover all other ways that people can\ncome with or another application can\ncome with uh send the same request\nbecause it analyze it the semat semantic\nmeaning uh as opposite to the just\nmatching or comparison.\n>> All right. uh we move to the questions\nand the first question is from Brian uh\nfrom LinkedIn. Uh the question is in a\nmulti-agent evaluation pipeline what\nparts of the agent process are treated\nas observable whether it's final\nresponse tool calls agent to agent\nmessages planner state memory reads and\nwrites routing decision or exe execution\ntraces that is a very good question\nuh so ultimately\nuh with frameworks like ADK And\nespecially on Google cloud everything is\nobservable right so you you can collect\nthe intermediate steps um it's um\ndepending on how you call your sub\nagents\nit may be um harder or easier to\nretrieve but ultimately everything is\navailable everything is traced and uh\nit's better than um if you provide it to\nyour graders whether are deterministic\nor not as a whole. However, uh there\nthere are always caveats, right? If if\nyou create like a proper sub agent with\nuh with uh with calling sub agent as as\ntools and not necessarily with\ndelegation, then whatever happened in\nthe sub agent may be hidden from you.\nThis is why it is important to to\nevaluate your agents at every\num sort of like at at the high highest\ngranularity possible. Right? So you and\nyou can see that we evaluate researchers\nresearcher individually and we should be\nevaluating every other agent as well.\nRight? So judge needs to be evaluated\ntoo, right? And then we evaluate system\nas a whole, right? And if we have those\nsubsystems like sub agents that have\nother sub aents uh you you need to\nevaluate those too just like with your\nsoftware right you have you have\ndifferent components and then you have\ncomponents integrated to bigger systems\nand then you have individual functions\nright if you want a good test coverage\nand the right test coverage in your uh\nsystem you do it at the very at at very\nsing uh at every single level and that's\nthat's the best practice and that sort\nof saves you from things that you know\nhappen um to be hidden, right? Like we\nwe we don't always have the entire trace\num not necessarily saved like we usually\nhave it saved in the in the tracing and\nthis is something that uh you can see\nthis lab also demonstrates how you how\nyou collect traces along the way. uh but\num when you submit it to the evaluation\nservice not every single call may be\nvisible so that's why you you want to do\nit um at different levels of granularity\nanything that you want to add to it le\nuh no I I feel you cover it I uh\ndepending what\nthe question mean because I I feel there\nare a couple of uh interpretation\nBut basically uh uh we want to go to the\nuh observable uh information for this\nparticular agent. So if this agent\ninvokes uh tools uh remotely or it\ninvokes uh other agents as part of the\nflow uh we don't want to uh to observe\nit as testing this particular agent\nsimply because again we are testing user\njourney for this agent. uh but I\ncompletely agree in general we want each\nuh component each agent to have its own\nset of tests for its own uh critical\nuser checks.\nUm one more question um from LinkedIn.\nHow would you design a continuous\nevaluation pipeline for\na multi-agent Gemini system that can\nautomatically distinguish whether a\nproduction failure was caused by the LLM\nengine orchestration\num tool execution retrieval or prompt\ndesign and continuously improve the\nsystem without introducing evaluation\nbias?\nUm that question has actually a lot of\nquestions in it. So I'll I'll I'll try\nto to give\nuh to do my best answering it. So the\nway you distinguish\nwhat\nintroduced the issue is that you need to\nlook at that interpretation. Remember\nhow we looked at the report and uh the\num the grader was essentially explaining\nlike what happened like why certain\npiece is not right. It's like writing\nrational rational for for that part. So\nyou need to look at that part and maybe\neven use another LLM to classify what\nwhat component introduced the problem\nwhether the the tool returned something\nwrong or the tool information was right\nuh or sorry the tool returned what it\nwas asked for but the query was wrong.\nFor example, I I did an experiment\nyesterday. I I added the the\nto the prompt of the researcher a\nsentence, replace Rome with Berlin. Like\nwhatever result you get from Wikipedia,\njust replace it. Right? And uh the LLM\ngrader was able to say he's like, \"Oh,\nbecause the agent was actually\ninstructed to uh to\nuh replace all Rome with Berlin.\nuh the LLM ended up asking the the tool\nfor a wrong city, not the one that the\nuser asked for. Right? So that that sort\nof classification um they were able to u\nto pick and this is something that you\ncan also analyze and and effectively do\na like create a little classifier that\nsays what component uh introduced um you\nknow the problem. So that's I think the\nbest way. So we keep we keep layering\nLLMs on top of other LLMs.\nuh but that's essentially the reality of\nthis nondetermin deterministic world\nthat we rely on LLMs a lot and like uh\nwe mentioned previously the way you\nmitigate this um risk potential is that\nyou do it many times right you you just\nrun the same prompt multiple times and\nsee what essentially was produced and\nwhat was the reason something failed\nsometimes failures are um I wouldn't say\ndeterminist istic entirely. Some\nfailures are not LLM related. It's just\nlike oh you make you made a query to\ndatabase and there was a glitch and it\ncame with an empty answer. Those you\nshould be able to spot just like okay\ntool call failed because it came back\nempty.\nUh I just wanted to add also that uh I\nbelieve uh we can in many cases uh\nseparate uh the problem of evaluation\ndetecting the the problem uh due to the\nchanges from the uh analysis because I\nfeel that this question is more on\nanalysis why specific test fails and not\nuh to\ndetect the failure itself. So analysis\nwe will use us usually implement some uh\nto automate it again we can uh employ uh\nvarious agents that specialize on\nanalysis of the code changes and uh we\ncan uh stream the information from the\nevaluation\nuh that comes uh as far as I know and\nVlad correct me if I'm wrong it is not\nformalized currently there's no like\ncommon standard for evaluating uh models\nacross uh the companies across platforms\nuh and the agent platform uses their own\nformat for returning the results. But we\ncan obviously uh teach agent uh about\nthe format and provide it as additional\ninput and automate the full uh like\npipeline. However, it is a bit outside\nof the scope of the evaluating uh\n>> Yeah, you're right. It's like you you\nrun unit tests. Let's think about\ntraditional software. They fail, right?\nAnd then as a developer, you kind of go\nback to this exploration mode,\nexperimentation mode, and and start\nanalyzing why did it fail, right? So,\nyou can have some tools that help you\nwith that, but that's not part of the\ndefense system. The the the job of the\ndefense system is to not let your uh bad\nagent to production.\nUh one more question in production\nmulti- aent system how do you recommend\nbalancing automated evaluation metrics\nlike um tool use quality trajectory\nprecision and ficination detection with\nhuman review? At what point should a\nteam trust the evaluation pipeline\nenough to promote u a new agent revision\nto production? Um there's no like right\nanswer to this question. Um the question\nitself obviously states like a very\nimportant reality is that uh people need\nto be involved right and uh the more\npeople you have reviewing uh those\nresults uh the better and yeah we we\nshould have um people um looking at it.\nI think the most important part of where\npeople are needed is when you create\nthose golden data sets, right? Those\ntest question, those test questions with\nuh if you have references like with\nreference answers, you have to look\nwhether\num those answers actually valid for the\nquestion. This is where you can you can\nsee that the reference is actually right\nbecause I often like talking to\ndifferent you know customers of of of\nGoogle uh we often see that um that data\nis collected. So there's the question\nand then the answer and then turns out\nthat the answer is actually wrong or\nonly partially right and some some human\nat some point came up with this answer\nand uh now we we use this uh reference\nanswer to decide whether LLM was right\nbut the answer turns out to be wrong and\nthen um the test fails and uh everyone\nis upset because they think that the\nagent is not working but in fact yeah I\nwas just like the the human wasn't right\num with with the answer that happens all\nthe time and LLM in this case they're\nactually able to catch the human mistake\nuh but when obviously how we how we\nlearn it oh we look at this failure and\nwe see what failed and then a human\nreads it more carefully and realizes\nthat turns out to be wrong and many\ncustomers they actually And um they use\nLLMs to again to judge those failures.\nBut this is again that's not part of the\ndefense system. We we move back to the\nthe experimentation mode. But this is\nimportant. You can employ LLM to explain\nwhy the answer is wrong. Why like what\nwhat this is now a reference answer and\nwhy it's it doesn't necessarily match\nand then LLM may tell you oh you know\nwhat the reference answer is wrong. So\num that's how we do this.\nUh I think uh that's kind of the time\nthat we have. Um I'm sorry that we\ndidn't answer all the questions that\nwere asked. Uh feel free to reach out uh\nto us on LinkedIn or ask other means.\nWe'll be happy to to answer more\nquestions. Um any final words Leoni that\nyou want to share with people just you\nknow recommendations?\n>> Uh so the most important uh is uh always\nevaluate AI agents. Don't believe that\nit will work just because you use a\nstrong model or you tested it manually\nuh at some point before production.\nuh this test uh the the lab shows how to\ntest as part of the uh continuous\nintegration and deployment pipeline. But\nyou can employ some of this testing just\nuh on production itself.\nuh the same way like uh some uh\nuh synthetic tests uh get run on the\nproduction environments to track that\nthings uh don't steer because again like\nwe mentioned uh agents utilize not only\nthe model but tools and tools are based\non the real time data which changes. So\nyour test could be fine at deployment\ntime and uh in a week or months\ndepending on your uh release cadence it\ncan be it can behave uh strange it can\ndrift\nuh and another thing uh at least what I\nam continue telling to myself first and\nto others as well uh the fact that it is\nAI doesn't uh inval validate your\nknowledge about how you test things, how\nyou build tests. You do have to build\nthe test harness. You obviously will\nhave to enhance the existing test\nharness, but it is the same methodology.\nIt is the same uh logic. Uh you will\njust have to learn a couple of new terms\nand uh new methods. uh part of them we\nshowed and I hope uh it was uh helpful\nand uh if you have questions please find\nus again Reddit, LinkedIn,\nhacker news maybe we sometimes uh\nannounce there as [snorts] well.\n>> Yeah, thank you so much Leonid uh for\nbeing here with me today. I hope um the\naudience uh liked what we presented and\nwill be able to use it in their\nday-to-day development.\num you know [snorts] uh create your\nagents um evaluate them always evaluate\nyour uh your agents um um use Google\ncloud to deploy them and run at scale\nand um go try the lab um deploy it\nyourself\nevaluate and see how it works for you\n>> I forgot to\n>> I forgot to mention next month we will\nbe broadcasting live uh another episode\nthat will be focused on uh protection of\nthe models. So please uh keep tracking\nthe announcement uh and join us next\nmonth.\n>> Thank you.\n>> Thank you so much and have fun.\nHey,\nhey,\nhey.","duration_seconds":5090,"is_auto_generated":true,"transcript_source":"youtube_transcript_api","transcript_status":"available","asr_quality_weight":0.6,"transcript_segments":[{"text":"[music]","start":36.68,"duration":2.02},{"text":"[music]","start":43.23,"duration":2.02},{"text":"We are looking at a 5,000 share order.","start":46.719,"duration":4.921},{"text":"[music]","start":52.4,"duration":2.02},{"text":"Hello and welcome back to another","start":77.439,"duration":5.201},{"text":"episode of Google Cloud Live. I'm Vlad","start":79.84,"duration":5.279},{"text":"Kallesnika, developer relations engineer","start":82.64,"duration":5.76},{"text":"at Google Cloud and today I'm here with","start":85.119,"duration":5.04},{"text":"Leonid.","start":88.4,"duration":3.679},{"text":"Leonid.","start":90.159,"duration":5.521},{"text":">> Hi. Hello Vlad. Uh thank you. Uh I'm","start":92.079,"duration":5.601},{"text":"Leonid. I am a developer advocate at","start":95.68,"duration":4.56},{"text":"Google Cloud working with Vlad uh in the","start":97.68,"duration":7.2},{"text":"same team. Uh I specialize on uh AI","start":100.24,"duration":8.08},{"text":"workload uh security and observability.","start":104.88,"duration":6.239},{"text":"But today I will be talking to you about","start":108.32,"duration":6.72},{"text":"uh data and agent evaluation.","start":111.119,"duration":6.481},{"text":">> Agent evaluation you know we say from","start":115.04,"duration":6.079},{"text":"VIP checks to datadriven evaluations. Uh","start":117.6,"duration":5.519},{"text":"what do you think VIP checks are? How","start":121.119,"duration":4.801},{"text":"would you describe them?","start":123.119,"duration":5.601},{"text":">> Thank you. Uh and let me switch to the","start":125.92,"duration":6.8},{"text":"slides for a moment. uh so when we uh","start":128.72,"duration":6.96},{"text":"speak about uh V evaluation","start":132.72,"duration":6.56},{"text":"uh or before that uh I want to address","start":135.68,"duration":5.919},{"text":"is a question why do we need to evaluate","start":139.28,"duration":4.8},{"text":"AI agents in the first place today","start":141.599,"duration":5.521},{"text":"perception is that AI is smart enough uh","start":144.08,"duration":7.36},{"text":"to do the work and uh so why do we need","start":147.12,"duration":10.479},{"text":"uh to test AI uh the main problem uh is","start":151.44,"duration":11.36},{"text":"that uh Today people don't believe AI","start":157.599,"duration":7.92},{"text":"like they did in the beginning when it","start":162.8,"duration":6.0},{"text":"just came. Uh I believe that our","start":165.519,"duration":7.761},{"text":"audience also don't think that uh AI is","start":168.8,"duration":7.519},{"text":"perfect and never makes mistakes.","start":173.28,"duration":5.84},{"text":"So uh when we speak about wipe check we","start":176.319,"duration":5.521},{"text":"usually speak about uh testings that we","start":179.12,"duration":7.36},{"text":"do manually. We run uh our application,","start":181.84,"duration":8.96},{"text":"our agent. We ask a few questions uh by","start":186.48,"duration":6.32},{"text":"manually typing or feeding the","start":190.8,"duration":5.519},{"text":"information to the agent. And uh then","start":192.8,"duration":6.4},{"text":"depending what kind of answers we get.","start":196.319,"duration":4.801},{"text":"If it is good, we assume that everything","start":199.2,"duration":5.44},{"text":"is works and if they are bad, we return","start":201.12,"duration":7.52},{"text":"back to the uh code or whatever","start":204.64,"duration":6.159},{"text":"configuration of the agent uh we work","start":208.64,"duration":5.519},{"text":"with and we try to adjust some uh","start":210.799,"duration":5.36},{"text":"settings, configurations, prompts and so","start":214.159,"duration":5.201},{"text":"on. So basically, you know, it it's kind","start":216.159,"duration":7.041},{"text":"of similar to testing um normal software","start":219.36,"duration":7.2},{"text":"and you know, if if you just write a a","start":223.2,"duration":5.599},{"text":"regular deterministic software, you can","start":226.56,"duration":4.239},{"text":"make a change and then you can just like","start":228.799,"duration":4.72},{"text":"run your application or your service,","start":230.799,"duration":5.201},{"text":"try a few things like your happy paths,","start":233.519,"duration":4.8},{"text":"see that they work and you just like","start":236.0,"duration":4.72},{"text":"push it to production. And we know that","start":238.319,"duration":6.48},{"text":"it often breaks with AI, I guess, uh at","start":240.72,"duration":6.719},{"text":"generative AI these days. uh it's even","start":244.799,"duration":6.961},{"text":"worse because the AI is like generative","start":247.439,"duration":7.44},{"text":"AI is fundamentally not deterministic.","start":251.76,"duration":6.8},{"text":"So even when we try something uh and it","start":254.879,"duration":6.32},{"text":"works at the moment that may not work uh","start":258.56,"duration":4.24},{"text":"next time. So kind of","start":261.199,"duration":2.56},{"text":">> it is true.","start":262.8,"duration":3.52},{"text":">> Yeah. So basically the importance of","start":263.759,"duration":6.801},{"text":"testing uh or checking um it increases","start":266.32,"duration":6.8},{"text":"and why why don't we just like test it","start":270.56,"duration":4.48},{"text":"manually right like like we do with the","start":273.12,"duration":3.84},{"text":"with the regular software remember like","start":275.04,"duration":4.8},{"text":"in the past we had all those testers and","start":276.96,"duration":7.12},{"text":"and then uh we end up um having","start":279.84,"duration":6.16},{"text":"automated tests and then uh people","start":284.08,"duration":4.88},{"text":"started talking about unit tests um so","start":286.0,"duration":4.639},{"text":"that's kind of you know that this is all","start":288.96,"duration":5.12},{"text":"applies u to traditional software What","start":290.639,"duration":5.84},{"text":"about AI these days?","start":294.08,"duration":5.28},{"text":">> So, uh I have news for you. We still","start":296.479,"duration":5.521},{"text":"have this people.","start":299.36,"duration":5.2},{"text":"We still have K departments that do uh a","start":302.0,"duration":6.72},{"text":"lot of uh tests manually like uh running","start":304.56,"duration":6.72},{"text":"by themselves. Uh and to be honest, as a","start":308.72,"duration":5.199},{"text":"developer advocate, I often do this kind","start":311.28,"duration":5.68},{"text":"of uh test by running user journeys on","start":313.919,"duration":6.0},{"text":"our products that yet to be released.","start":316.96,"duration":5.84},{"text":"However, you are completely right. Uh","start":319.919,"duration":5.84},{"text":"the main reason that uh manual testing","start":322.8,"duration":6.0},{"text":"doesn't work well is actually twofold.","start":325.759,"duration":8.241},{"text":"Uh first and uh it is very trivial uh to","start":328.8,"duration":9.36},{"text":"test all uh possible scenarios that we","start":334.0,"duration":8.32},{"text":"know or expect to work like golden pass","start":338.16,"duration":7.52},{"text":"or critical user journeys uh isn't","start":342.32,"duration":6.56},{"text":"effective when we speak about uh uh","start":345.68,"duration":6.48},{"text":"application or service at scale because","start":348.88,"duration":6.08},{"text":"uh there are too many and each time when","start":352.16,"duration":6.8},{"text":"we introduce a change or like submit uh","start":354.96,"duration":6.604},{"text":"a PR request or something like this","start":358.96,"duration":4.56},{"text":"[clears throat] to run it uh by people","start":361.564,"duration":5.316},{"text":"it is just very very slow. Another","start":363.52,"duration":5.84},{"text":"reason for this is that uh when people","start":366.88,"duration":5.84},{"text":"run it uh again I don't know your","start":369.36,"duration":7.2},{"text":"experience with QA teams. Uh in my uh","start":372.72,"duration":7.52},{"text":"previous uh workplaces QA teams were uh","start":376.56,"duration":8.88},{"text":"filing some uh test reports besides the","start":380.24,"duration":8.72},{"text":"bug that uh they found describing what","start":385.44,"duration":6.319},{"text":"they actually test. They had test plan","start":388.96,"duration":5.44},{"text":"documents that they checked. I did this","start":391.759,"duration":6.0},{"text":"test. I did that test. Uh but in many","start":394.4,"duration":6.32},{"text":"cases still a lot of uh stuff remains","start":397.759,"duration":6.081},{"text":"behind the scene and so all this test","start":400.72,"duration":6.64},{"text":"sessions uh kind of ephemeral. Uh you","start":403.84,"duration":7.28},{"text":"can't like track what actually was done.","start":407.36,"duration":7.04},{"text":"Uh you can't uh compare against previous","start":411.12,"duration":6.639},{"text":"executions and success. And this is why","start":414.4,"duration":8.32},{"text":"we actually moved uh from uh uh like uh","start":417.759,"duration":8.401},{"text":"focus on manual test by the teams to","start":422.72,"duration":5.599},{"text":"more automated approach.","start":426.16,"duration":6.319},{"text":">> I see. I see. So um now if we look at","start":428.319,"duration":7.921},{"text":"this um testing software that that we do","start":432.479,"duration":7.521},{"text":"or evaluation right uh with AI and","start":436.24,"duration":5.679},{"text":"software is similar but with with","start":440.0,"duration":4.0},{"text":"generative AI it it's kind of a common","start":441.919,"duration":4.161},{"text":"thing and it comes from traditional","start":444.0,"duration":6.96},{"text":"machine learning uh we do evaluate stuff","start":446.08,"duration":7.04},{"text":"um during development right during the","start":450.96,"duration":5.92},{"text":"experimentation stage and uh we we","start":453.12,"duration":5.44},{"text":"change our prompts and then we see","start":456.88,"duration":2.719},{"text":"whether","start":458.56,"duration":4.4},{"text":"you know agents behave better or worse.","start":459.599,"duration":6.481},{"text":"Um so now um what's the difference","start":462.96,"duration":6.72},{"text":"between this experimentation and really","start":466.08,"duration":7.04},{"text":"the the evaluation in uh in the form of","start":469.68,"duration":5.359},{"text":"you know where traditional unit tests","start":473.12,"duration":4.0},{"text":"are used.","start":475.039,"duration":10.0},{"text":"So uh as uh we said in general the","start":477.12,"duration":10.479},{"text":"manual testing especially for AI it is","start":485.039,"duration":7.28},{"text":"uh much more I would say uh relevant","start":487.599,"duration":8.081},{"text":"uh manual testing uh for AI phase when","start":492.319,"duration":7.201},{"text":"we try to tweak uh our configuration for","start":495.68,"duration":7.359},{"text":"AI we don't actually code AI right we","start":499.52,"duration":6.799},{"text":"don't uh modify code for the model in","start":503.039,"duration":3.6},{"text":"general","start":506.319,"duration":2.481},{"text":">> we still can we still can right","start":506.639,"duration":4.481},{"text":">> we can but uh when I speak about","start":508.8,"duration":4.719},{"text":"software developers not data engineers","start":511.12,"duration":6.159},{"text":"not AI science scientists","start":513.519,"duration":7.76},{"text":"uh we uh adjust some configuration like","start":517.279,"duration":5.76},{"text":"uh there is a parameter called","start":521.279,"duration":5.761},{"text":"temperature which describes how much we","start":523.039,"duration":8.8},{"text":"allow to uh the model to uh experiment","start":527.04,"duration":9.52},{"text":"or go to uh some discovery interface uh","start":531.839,"duration":7.44},{"text":"and there are other things like uh","start":536.56,"duration":5.36},{"text":"system instructions, prompts, uh","start":539.279,"duration":5.041},{"text":"additional uh information that we can uh","start":541.92,"duration":6.32},{"text":"plug to uh the system or some logic uh","start":544.32,"duration":7.6},{"text":"provided with the agent configuration.","start":548.24,"duration":6.88},{"text":"uh so this is kind of uh the phase that","start":551.92,"duration":7.28},{"text":"we uh can call discovery when we uh","start":555.12,"duration":8.08},{"text":"experiment and uh it is hard to uh","start":559.2,"duration":5.52},{"text":"algorithmize","start":563.2,"duration":6.079},{"text":"uh experiments. So it is naturally uh","start":564.72,"duration":8.08},{"text":"have a creative and open-ending uh natur","start":569.279,"duration":8.321},{"text":"uh approach which uh makes sense to do","start":572.8,"duration":8.08},{"text":"manually uh to have a a person uh doing","start":577.6,"duration":7.76},{"text":"it versus the agent. uh on another side","start":580.88,"duration":7.84},{"text":"we have uh our CI/CD process that have","start":585.36,"duration":7.919},{"text":"to be rigid and repetitive and this","start":588.72,"duration":5.6},{"text":"process","start":593.279,"duration":3.521},{"text":">> yeah ju just for for the audience like","start":594.32,"duration":4.16},{"text":"when we say CI/CD we mean like","start":596.8,"duration":3.28},{"text":"continuous integration","start":598.48,"duration":3.44},{"text":">> and continuous deployment as far as I","start":600.08,"duration":2.879},{"text":"remember right","start":601.92,"duration":3.039},{"text":">> and this is where","start":602.959,"duration":5.041},{"text":">> yeah this is where effectively our uh","start":604.959,"duration":5.521},{"text":"software in this case our agent is sort","start":608.0,"duration":5.519},{"text":"of like done or almost done right in and","start":610.48,"duration":5.599},{"text":"it's in the process of final tweaks or","start":613.519,"duration":4.401},{"text":"it's in the process of being uh","start":616.079,"duration":4.081},{"text":"continuously improved because the you","start":617.92,"duration":3.919},{"text":"know this is what we do we continuously","start":620.16,"duration":4.32},{"text":"improve our software we test new models","start":621.839,"duration":7.041},{"text":"we do little tweaks uh with our prompts","start":624.48,"duration":7.76},{"text":"uh or systems instruction etc etc right","start":628.88,"duration":5.12},{"text":"so I","start":632.24,"duration":5.44},{"text":">> I would say yes and no actually because","start":634.0,"duration":8.48},{"text":"uh when we say agent uh today uh this uh","start":637.68,"duration":8.08},{"text":"has certain ambiguity because the engine","start":642.48,"duration":5.2},{"text":"includes some interaction with the","start":645.76,"duration":4.72},{"text":"model. I can simply change the model I","start":647.68,"duration":5.599},{"text":"use uh and start using a different model","start":650.48,"duration":5.52},{"text":"which can completely change uh the","start":653.279,"duration":5.201},{"text":"behavior of the overall application. But","start":656.0,"duration":7.519},{"text":"I can also go and uh add uh some uh","start":658.48,"duration":8.08},{"text":"tools uh some instrument in instrument","start":663.519,"duration":7.841},{"text":"that model can invoke through uh some uh","start":666.56,"duration":8.08},{"text":"thinking process. Uh the model can uh be","start":671.36,"duration":6.4},{"text":"made aware by the agent that it can uh","start":674.64,"duration":5.92},{"text":"use some tools and these tools","start":677.76,"duration":5.519},{"text":"essentially implemented as uh simple","start":680.56,"duration":6.0}],"view_count_at_fetch":2706,"transcript_available":true,"transcript_last_attempted_at":"2026-07-01T06:08:13.947053+00:00"},"tags_json":[],"is_discovered_source":false},"transcript":{"segment_count":200,"markdown":"# Transcript\n\n## Segment 1\n\n**Speaker:** Unknown speaker\n\n[music]\n\n## Segment 2\n\n**Speaker:** Unknown speaker\n\n[music]\n\n## Segment 3\n\n**Speaker:** Unknown speaker\n\nWe are looking at a 5,000 share order.\n\n## Segment 4\n\n**Speaker:** Unknown speaker\n\n[music]\n\n## Segment 5\n\n**Speaker:** Unknown speaker\n\nHello and welcome back to another\n\n## Segment 6\n\n**Speaker:** Unknown speaker\n\nepisode of Google Cloud Live. I'm Vlad\n\n## Segment 7\n\n**Speaker:** Unknown speaker\n\nKallesnika, developer relations engineer\n\n## Segment 8\n\n**Speaker:** Unknown speaker\n\nat Google Cloud and today I'm here with\n\n## Segment 9\n\n**Speaker:** Unknown speaker\n\nLeonid.\n\n## Segment 10\n\n**Speaker:** Unknown speaker\n\nLeonid.\n\n## Segment 11\n\n**Speaker:** Unknown speaker\n\n>> Hi. Hello Vlad. Uh thank you. Uh I'm\n\n## Segment 12\n\n**Speaker:** Unknown speaker\n\nLeonid. I am a developer advocate at\n\n## Segment 13\n\n**Speaker:** Unknown speaker\n\nGoogle Cloud working with Vlad uh in the\n\n## Segment 14\n\n**Speaker:** Unknown speaker\n\nsame team. Uh I specialize on uh AI\n\n## Segment 15\n\n**Speaker:** Unknown speaker\n\nworkload uh security and observability.\n\n## Segment 16\n\n**Speaker:** Unknown speaker\n\nBut today I will be talking to you about\n\n## Segment 17\n\n**Speaker:** Unknown speaker\n\nuh data and agent evaluation.\n\n## Segment 18\n\n**Speaker:** Unknown speaker\n\n>> Agent evaluation you know we say from\n\n## Segment 19\n\n**Speaker:** Unknown speaker\n\nVIP checks to datadriven evaluations. Uh\n\n## Segment 20\n\n**Speaker:** Unknown speaker\n\nwhat do you think VIP checks are? How\n\n## Segment 21\n\n**Speaker:** Unknown speaker\n\nwould you describe them?\n\n## Segment 22\n\n**Speaker:** Unknown speaker\n\n>> Thank you. Uh and let me switch to the\n\n## Segment 23\n\n**Speaker:** Unknown speaker\n\nslides for a moment. uh so when we uh\n\n## Segment 24\n\n**Speaker:** Unknown speaker\n\nspeak about uh V evaluation\n\n## Segment 25\n\n**Speaker:** Unknown speaker\n\nuh or before that uh I want to address\n\n## Segment 26\n\n**Speaker:** Unknown speaker\n\nis a question why do we need to evaluate\n\n## Segment 27\n\n**Speaker:** Unknown speaker\n\nAI agents in the first place today\n\n## Segment 28\n\n**Speaker:** Unknown speaker\n\nperception is that AI is smart enough uh\n\n## Segment 29\n\n**Speaker:** Unknown speaker\n\nto do the work and uh so why do we need\n\n## Segment 30\n\n**Speaker:** Unknown speaker\n\nuh to test AI uh the main problem uh is\n\n## Segment 31\n\n**Speaker:** Unknown speaker\n\nthat uh Today people don't believe AI\n\n## Segment 32\n\n**Speaker:** Unknown speaker\n\nlike they did in the beginning when it\n\n## Segment 33\n\n**Speaker:** Unknown speaker\n\njust came. Uh I believe that our\n\n## Segment 34\n\n**Speaker:** Unknown speaker\n\naudience also don't think that uh AI is\n\n## Segment 35\n\n**Speaker:** Unknown speaker\n\nperfect and never makes mistakes.\n\n## Segment 36\n\n**Speaker:** Unknown speaker\n\nSo uh when we speak about wipe check we\n\n## Segment 37\n\n**Speaker:** Unknown speaker\n\nusually speak about uh testings that we\n\n## Segment 38\n\n**Speaker:** Unknown speaker\n\ndo manually. We run uh our application,\n\n## Segment 39\n\n**Speaker:** Unknown speaker\n\nour agent. We ask a few questions uh by\n\n## Segment 40\n\n**Speaker:** Unknown speaker\n\nmanually typing or feeding the\n\n## Segment 41\n\n**Speaker:** Unknown speaker\n\ninformation to the agent. And uh then\n\n## Segment 42\n\n**Speaker:** Unknown speaker\n\ndepending what kind of answers we get.\n\n## Segment 43\n\n**Speaker:** Unknown speaker\n\nIf it is good, we assume that everything\n\n## Segment 44\n\n**Speaker:** Unknown speaker\n\nis works and if they are bad, we return\n\n## Segment 45\n\n**Speaker:** Unknown speaker\n\nback to the uh code or whatever\n\n## Segment 46\n\n**Speaker:** Unknown speaker\n\nconfiguration of the agent uh we work\n\n## Segment 47\n\n**Speaker:** Unknown speaker\n\nwith and we try to adjust some uh\n\n## Segment 48\n\n**Speaker:** Unknown speaker\n\nsettings, configurations, prompts and so\n\n## Segment 49\n\n**Speaker:** Unknown speaker\n\non. So basically, you know, it it's kind\n\n## Segment 50\n\n**Speaker:** Unknown speaker\n\nof similar to testing um normal software\n\n## Segment 51\n\n**Speaker:** Unknown speaker\n\nand you know, if if you just write a a\n\n## Segment 52\n\n**Speaker:** Unknown speaker\n\nregular deterministic software, you can\n\n## Segment 53\n\n**Speaker:** Unknown speaker\n\nmake a change and then you can just like\n\n## Segment 54\n\n**Speaker:** Unknown speaker\n\nrun your application or your service,\n\n## Segment 55\n\n**Speaker:** Unknown speaker\n\ntry a few things like your happy paths,\n\n## Segment 56\n\n**Speaker:** Unknown speaker\n\nsee that they work and you just like\n\n## Segment 57\n\n**Speaker:** Unknown speaker\n\npush it to production. And we know that\n\n## Segment 58\n\n**Speaker:** Unknown speaker\n\nit often breaks with AI, I guess, uh at\n\n## Segment 59\n\n**Speaker:** Unknown speaker\n\ngenerative AI these days. uh it's even\n\n## Segment 60\n\n**Speaker:** Unknown speaker\n\nworse because the AI is like generative\n\n## Segment 61\n\n**Speaker:** Unknown speaker\n\nAI is fundamentally not deterministic.\n\n## Segment 62\n\n**Speaker:** Unknown speaker\n\nSo even when we try something uh and it\n\n## Segment 63\n\n**Speaker:** Unknown speaker\n\nworks at the moment that may not work uh\n\n## Segment 64\n\n**Speaker:** Unknown speaker\n\nnext time. So kind of\n\n## Segment 65\n\n**Speaker:** Unknown speaker\n\n>> it is true.\n\n## Segment 66\n\n**Speaker:** Unknown speaker\n\n>> Yeah. So basically the importance of\n\n## Segment 67\n\n**Speaker:** Unknown speaker\n\ntesting uh or checking um it increases\n\n## Segment 68\n\n**Speaker:** Unknown speaker\n\nand why why don't we just like test it\n\n## Segment 69\n\n**Speaker:** Unknown speaker\n\nmanually right like like we do with the\n\n## Segment 70\n\n**Speaker:** Unknown speaker\n\nwith the regular software remember like\n\n## Segment 71\n\n**Speaker:** Unknown speaker\n\nin the past we had all those testers and\n\n## Segment 72\n\n**Speaker:** Unknown speaker\n\nand then uh we end up um having\n\n## Segment 73\n\n**Speaker:** Unknown speaker\n\nautomated tests and then uh people\n\n## Segment 74\n\n**Speaker:** Unknown speaker\n\nstarted talking about unit tests um so\n\n## Segment 75\n\n**Speaker:** Unknown speaker\n\nthat's kind of you know that this is all\n\n## Segment 76\n\n**Speaker:** Unknown speaker\n\napplies u to traditional software What\n\n## Segment 77\n\n**Speaker:** Unknown speaker\n\nabout AI these days?\n\n## Segment 78\n\n**Speaker:** Unknown speaker\n\n>> So, uh I have news for you. We still\n\n## Segment 79\n\n**Speaker:** Unknown speaker\n\nhave this people.\n\n## Segment 80\n\n**Speaker:** Unknown speaker\n\nWe still have K departments that do uh a\n\n## Segment 81\n\n**Speaker:** Unknown speaker\n\nlot of uh tests manually like uh running\n\n## Segment 82\n\n**Speaker:** Unknown speaker\n\nby themselves. Uh and to be honest, as a\n\n## Segment 83\n\n**Speaker:** Unknown speaker\n\ndeveloper advocate, I often do this kind\n\n## Segment 84\n\n**Speaker:** Unknown speaker\n\nof uh test by running user journeys on\n\n## Segment 85\n\n**Speaker:** Unknown speaker\n\nour products that yet to be released.\n\n## Segment 86\n\n**Speaker:** Unknown speaker\n\nHowever, you are completely right. Uh\n\n## Segment 87\n\n**Speaker:** Unknown speaker\n\nthe main reason that uh manual testing\n\n## Segment 88\n\n**Speaker:** Unknown speaker\n\ndoesn't work well is actually twofold.\n\n## Segment 89\n\n**Speaker:** Unknown speaker\n\nUh first and uh it is very trivial uh to\n\n## Segment 90\n\n**Speaker:** Unknown speaker\n\ntest all uh possible scenarios that we\n\n## Segment 91\n\n**Speaker:** Unknown speaker\n\nknow or expect to work like golden pass\n\n## Segment 92\n\n**Speaker:** Unknown speaker\n\nor critical user journeys uh isn't\n\n## Segment 93\n\n**Speaker:** Unknown speaker\n\neffective when we speak about uh uh\n\n## Segment 94\n\n**Speaker:** Unknown speaker\n\napplication or service at scale because\n\n## Segment 95\n\n**Speaker:** Unknown speaker\n\nuh there are too many and each time when\n\n## Segment 96\n\n**Speaker:** Unknown speaker\n\nwe introduce a change or like submit uh\n\n## Segment 97\n\n**Speaker:** Unknown speaker\n\na PR request or something like this\n\n## Segment 98\n\n**Speaker:** Unknown speaker\n\n[clears throat] to run it uh by people\n\n## Segment 99\n\n**Speaker:** Unknown speaker\n\nit is just very very slow. Another\n\n## Segment 100\n\n**Speaker:** Unknown speaker\n\nreason for this is that uh when people\n\n## Segment 101\n\n**Speaker:** Unknown speaker\n\nrun it uh again I don't know your\n\n## Segment 102\n\n**Speaker:** Unknown speaker\n\nexperience with QA teams. Uh in my uh\n\n## Segment 103\n\n**Speaker:** Unknown speaker\n\nprevious uh workplaces QA teams were uh\n\n## Segment 104\n\n**Speaker:** Unknown speaker\n\nfiling some uh test reports besides the\n\n## Segment 105\n\n**Speaker:** Unknown speaker\n\nbug that uh they found describing what\n\n## Segment 106\n\n**Speaker:** Unknown speaker\n\nthey actually test. They had test plan\n\n## Segment 107\n\n**Speaker:** Unknown speaker\n\ndocuments that they checked. I did this\n\n## Segment 108\n\n**Speaker:** Unknown speaker\n\ntest. I did that test. Uh but in many\n\n## Segment 109\n\n**Speaker:** Unknown speaker\n\ncases still a lot of uh stuff remains\n\n## Segment 110\n\n**Speaker:** Unknown speaker\n\nbehind the scene and so all this test\n\n## Segment 111\n\n**Speaker:** Unknown speaker\n\nsessions uh kind of ephemeral. Uh you\n\n## Segment 112\n\n**Speaker:** Unknown speaker\n\ncan't like track what actually was done.\n\n## Segment 113\n\n**Speaker:** Unknown speaker\n\nUh you can't uh compare against previous\n\n## Segment 114\n\n**Speaker:** Unknown speaker\n\nexecutions and success. And this is why\n\n## Segment 115\n\n**Speaker:** Unknown speaker\n\nwe actually moved uh from uh uh like uh\n\n## Segment 116\n\n**Speaker:** Unknown speaker\n\nfocus on manual test by the teams to\n\n## Segment 117\n\n**Speaker:** Unknown speaker\n\nmore automated approach.\n\n## Segment 118\n\n**Speaker:** Unknown speaker\n\n>> I see. I see. So um now if we look at\n\n## Segment 119\n\n**Speaker:** Unknown speaker\n\nthis um testing software that that we do\n\n## Segment 120\n\n**Speaker:** Unknown speaker\n\nor evaluation right uh with AI and\n\n## Segment 121\n\n**Speaker:** Unknown speaker\n\nsoftware is similar but with with\n\n## Segment 122\n\n**Speaker:** Unknown speaker\n\ngenerative AI it it's kind of a common\n\n## Segment 123\n\n**Speaker:** Unknown speaker\n\nthing and it comes from traditional\n\n## Segment 124\n\n**Speaker:** Unknown speaker\n\nmachine learning uh we do evaluate stuff\n\n## Segment 125\n\n**Speaker:** Unknown speaker\n\num during development right during the\n\n## Segment 126\n\n**Speaker:** Unknown speaker\n\nexperimentation stage and uh we we\n\n## Segment 127\n\n**Speaker:** Unknown speaker\n\nchange our prompts and then we see\n\n## Segment 128\n\n**Speaker:** Unknown speaker\n\nwhether\n\n## Segment 129\n\n**Speaker:** Unknown speaker\n\nyou know agents behave better or worse.\n\n## Segment 130\n\n**Speaker:** Unknown speaker\n\nUm so now um what's the difference\n\n## Segment 131\n\n**Speaker:** Unknown speaker\n\nbetween this experimentation and really\n\n## Segment 132\n\n**Speaker:** Unknown speaker\n\nthe the evaluation in uh in the form of\n\n## Segment 133\n\n**Speaker:** Unknown speaker\n\nyou know where traditional unit tests\n\n## Segment 134\n\n**Speaker:** Unknown speaker\n\nare used.\n\n## Segment 135\n\n**Speaker:** Unknown speaker\n\nSo uh as uh we said in general the\n\n## Segment 136\n\n**Speaker:** Unknown speaker\n\nmanual testing especially for AI it is\n\n## Segment 137\n\n**Speaker:** Unknown speaker\n\nuh much more I would say uh relevant\n\n## Segment 138\n\n**Speaker:** Unknown speaker\n\nuh manual testing uh for AI phase when\n\n## Segment 139\n\n**Speaker:** Unknown speaker\n\nwe try to tweak uh our configuration for\n\n## Segment 140\n\n**Speaker:** Unknown speaker\n\nAI we don't actually code AI right we\n\n## Segment 141\n\n**Speaker:** Unknown speaker\n\ndon't uh modify code for the model in\n\n## Segment 142\n\n**Speaker:** Unknown speaker\n\ngeneral\n\n## Segment 143\n\n**Speaker:** Unknown speaker\n\n>> we still can we still can right\n\n## Segment 144\n\n**Speaker:** Unknown speaker\n\n>> we can but uh when I speak about\n\n## Segment 145\n\n**Speaker:** Unknown speaker\n\nsoftware developers not data engineers\n\n## Segment 146\n\n**Speaker:** Unknown speaker\n\nnot AI science scientists\n\n## Segment 147\n\n**Speaker:** Unknown speaker\n\nuh we uh adjust some configuration like\n\n## Segment 148\n\n**Speaker:** Unknown speaker\n\nuh there is a parameter called\n\n## Segment 149\n\n**Speaker:** Unknown speaker\n\ntemperature which describes how much we\n\n## Segment 150\n\n**Speaker:** Unknown speaker\n\nallow to uh the model to uh experiment\n\n## Segment 151\n\n**Speaker:** Unknown speaker\n\nor go to uh some discovery interface uh\n\n## Segment 152\n\n**Speaker:** Unknown speaker\n\nand there are other things like uh\n\n## Segment 153\n\n**Speaker:** Unknown speaker\n\nsystem instructions, prompts, uh\n\n## Segment 154\n\n**Speaker:** Unknown speaker\n\nadditional uh information that we can uh\n\n## Segment 155\n\n**Speaker:** Unknown speaker\n\nplug to uh the system or some logic uh\n\n## Segment 156\n\n**Speaker:** Unknown speaker\n\nprovided with the agent configuration.\n\n## Segment 157\n\n**Speaker:** Unknown speaker\n\nuh so this is kind of uh the phase that\n\n## Segment 158\n\n**Speaker:** Unknown speaker\n\nwe uh can call discovery when we uh\n\n## Segment 159\n\n**Speaker:** Unknown speaker\n\nexperiment and uh it is hard to uh\n\n## Segment 160\n\n**Speaker:** Unknown speaker\n\nalgorithmize\n\n## Segment 161\n\n**Speaker:** Unknown speaker\n\nuh experiments. So it is naturally uh\n\n## Segment 162\n\n**Speaker:** Unknown speaker\n\nhave a creative and open-ending uh natur\n\n## Segment 163\n\n**Speaker:** Unknown speaker\n\nuh approach which uh makes sense to do\n\n## Segment 164\n\n**Speaker:** Unknown speaker\n\nmanually uh to have a a person uh doing\n\n## Segment 165\n\n**Speaker:** Unknown speaker\n\nit versus the agent. uh on another side\n\n## Segment 166\n\n**Speaker:** Unknown speaker\n\nwe have uh our CI/CD process that have\n\n## Segment 167\n\n**Speaker:** Unknown speaker\n\nto be rigid and repetitive and this\n\n## Segment 168\n\n**Speaker:** Unknown speaker\n\nprocess\n\n## Segment 169\n\n**Speaker:** Unknown speaker\n\n>> yeah ju just for for the audience like\n\n## Segment 170\n\n**Speaker:** Unknown speaker\n\nwhen we say CI/CD we mean like\n\n## Segment 171\n\n**Speaker:** Unknown speaker\n\ncontinuous integration\n\n## Segment 172\n\n**Speaker:** Unknown speaker\n\n>> and continuous deployment as far as I\n\n## Segment 173\n\n**Speaker:** Unknown speaker\n\nremember right\n\n## Segment 174\n\n**Speaker:** Unknown speaker\n\n>> and this is where\n\n## Segment 175\n\n**Speaker:** Unknown speaker\n\n>> yeah this is where effectively our uh\n\n## Segment 176\n\n**Speaker:** Unknown speaker\n\nsoftware in this case our agent is sort\n\n## Segment 177\n\n**Speaker:** Unknown speaker\n\nof like done or almost done right in and\n\n## Segment 178\n\n**Speaker:** Unknown speaker\n\nit's in the process of final tweaks or\n\n## Segment 179\n\n**Speaker:** Unknown speaker\n\nit's in the process of being uh\n\n## Segment 180\n\n**Speaker:** Unknown speaker\n\ncontinuously improved because the you\n\n## Segment 181\n\n**Speaker:** Unknown speaker\n\nknow this is what we do we continuously\n\n## Segment 182\n\n**Speaker:** Unknown speaker\n\nimprove our software we test new models\n\n## Segment 183\n\n**Speaker:** Unknown speaker\n\nwe do little tweaks uh with our prompts\n\n## Segment 184\n\n**Speaker:** Unknown speaker\n\nuh or systems instruction etc etc right\n\n## Segment 185\n\n**Speaker:** Unknown speaker\n\nso I\n\n## Segment 186\n\n**Speaker:** Unknown speaker\n\n>> I would say yes and no actually because\n\n## Segment 187\n\n**Speaker:** Unknown speaker\n\nuh when we say agent uh today uh this uh\n\n## Segment 188\n\n**Speaker:** Unknown speaker\n\nhas certain ambiguity because the engine\n\n## Segment 189\n\n**Speaker:** Unknown speaker\n\nincludes some interaction with the\n\n## Segment 190\n\n**Speaker:** Unknown speaker\n\nmodel. I can simply change the model I\n\n## Segment 191\n\n**Speaker:** Unknown speaker\n\nuse uh and start using a different model\n\n## Segment 192\n\n**Speaker:** Unknown speaker\n\nwhich can completely change uh the\n\n## Segment 193\n\n**Speaker:** Unknown speaker\n\nbehavior of the overall application. But\n\n## Segment 194\n\n**Speaker:** Unknown speaker\n\nI can also go and uh add uh some uh\n\n## Segment 195\n\n**Speaker:** Unknown speaker\n\ntools uh some instrument in instrument\n\n## Segment 196\n\n**Speaker:** Unknown speaker\n\nthat model can invoke through uh some uh\n\n## Segment 197\n\n**Speaker:** Unknown speaker\n\nthinking process. Uh the model can uh be\n\n## Segment 198\n\n**Speaker:** Unknown speaker\n\nmade aware by the agent that it can uh\n\n## Segment 199\n\n**Speaker:** Unknown speaker\n\nuse some tools and these tools\n\n## Segment 200\n\n**Speaker:** Unknown speaker\n\nessentially implemented as uh simple","text":"[segment 0] Unknown speaker: [music]\n[segment 1] Unknown speaker: [music]\n[segment 2] Unknown speaker: We are looking at a 5,000 share order.\n[segment 3] Unknown speaker: [music]\n[segment 4] Unknown speaker: Hello and welcome back to another\n[segment 5] Unknown speaker: episode of Google Cloud Live. I'm Vlad\n[segment 6] Unknown speaker: Kallesnika, developer relations engineer\n[segment 7] Unknown speaker: at Google Cloud and today I'm here with\n[segment 8] Unknown speaker: Leonid.\n[segment 9] Unknown speaker: Leonid.\n[segment 10] Unknown speaker: >> Hi. Hello Vlad. Uh thank you. Uh I'm\n[segment 11] Unknown speaker: Leonid. I am a developer advocate at\n[segment 12] Unknown speaker: Google Cloud working with Vlad uh in the\n[segment 13] Unknown speaker: same team. Uh I specialize on uh AI\n[segment 14] Unknown speaker: workload uh security and observability.\n[segment 15] Unknown speaker: But today I will be talking to you about\n[segment 16] Unknown speaker: uh data and agent evaluation.\n[segment 17] Unknown speaker: >> Agent evaluation you know we say from\n[segment 18] Unknown speaker: VIP checks to datadriven evaluations. Uh\n[segment 19] Unknown speaker: what do you think VIP checks are? How\n[segment 20] Unknown speaker: would you describe them?\n[segment 21] Unknown speaker: >> Thank you. Uh and let me switch to the\n[segment 22] Unknown speaker: slides for a moment. uh so when we uh\n[segment 23] Unknown speaker: speak about uh V evaluation\n[segment 24] Unknown speaker: uh or before that uh I want to address\n[segment 25] Unknown speaker: is a question why do we need to evaluate\n[segment 26] Unknown speaker: AI agents in the first place today\n[segment 27] Unknown speaker: perception is that AI is smart enough uh\n[segment 28] Unknown speaker: to do the work and uh so why do we need\n[segment 29] Unknown speaker: uh to test AI uh the main problem uh is\n[segment 30] Unknown speaker: that uh Today people don't believe AI\n[segment 31] Unknown speaker: like they did in the beginning when it\n[segment 32] Unknown speaker: just came. Uh I believe that our\n[segment 33] Unknown speaker: audience also don't think that uh AI is\n[segment 34] Unknown speaker: perfect and never makes mistakes.\n[segment 35] Unknown speaker: So uh when we speak about wipe check we\n[segment 36] Unknown speaker: usually speak about uh testings that we\n[segment 37] Unknown speaker: do manually. We run uh our application,\n[segment 38] Unknown speaker: our agent. We ask a few questions uh by\n[segment 39] Unknown speaker: manually typing or feeding the\n[segment 40] Unknown speaker: information to the agent. And uh then\n[segment 41] Unknown speaker: depending what kind of answers we get.\n[segment 42] Unknown speaker: If it is good, we assume that everything\n[segment 43] Unknown speaker: is works and if they are bad, we return\n[segment 44] Unknown speaker: back to the uh code or whatever\n[segment 45] Unknown speaker: configuration of the agent uh we work\n[segment 46] Unknown speaker: with and we try to adjust some uh\n[segment 47] Unknown speaker: settings, configurations, prompts and so\n[segment 48] Unknown speaker: on. So basically, you know, it it's kind\n[segment 49] Unknown speaker: of similar to testing um normal software\n[segment 50] Unknown speaker: and you know, if if you just write a a\n[segment 51] Unknown speaker: regular deterministic software, you can\n[segment 52] Unknown speaker: make a change and then you can just like\n[segment 53] Unknown speaker: run your application or your service,\n[segment 54] Unknown speaker: try a few things like your happy paths,\n[segment 55] Unknown speaker: see that they work and you just like\n[segment 56] Unknown speaker: push it to production. And we know that\n[segment 57] Unknown speaker: it often breaks with AI, I guess, uh at\n[segment 58] Unknown speaker: generative AI these days. uh it's even\n[segment 59] Unknown speaker: worse because the AI is like generative\n[segment 60] Unknown speaker: AI is fundamentally not deterministic.\n[segment 61] Unknown speaker: So even when we try something uh and it\n[segment 62] Unknown speaker: works at the moment that may not work uh\n[segment 63] Unknown speaker: next time. So kind of\n[segment 64] Unknown speaker: >> it is true.\n[segment 65] Unknown speaker: >> Yeah. So basically the importance of\n[segment 66] Unknown speaker: testing uh or checking um it increases\n[segment 67] Unknown speaker: and why why don't we just like test it\n[segment 68] Unknown speaker: manually right like like we do with the\n[segment 69] Unknown speaker: with the regular software remember like\n[segment 70] Unknown speaker: in the past we had all those testers and\n[segment 71] Unknown speaker: and then uh we end up um having\n[segment 72] Unknown speaker: automated tests and then uh people\n[segment 73] Unknown speaker: started talking about unit tests um so\n[segment 74] Unknown speaker: that's kind of you know that this is all\n[segment 75] Unknown speaker: applies u to traditional software What\n[segment 76] Unknown speaker: about AI these days?\n[segment 77] Unknown speaker: >> So, uh I have news for you. We still\n[segment 78] Unknown speaker: have this people.\n[segment 79] Unknown speaker: We still have K departments that do uh a\n[segment 80] Unknown speaker: lot of uh tests manually like uh running\n[segment 81] Unknown speaker: by themselves. Uh and to be honest, as a\n[segment 82] Unknown speaker: developer advocate, I often do this kind\n[segment 83] Unknown speaker: of uh test by running user journeys on\n[segment 84] Unknown speaker: our products that yet to be released.\n[segment 85] Unknown speaker: However, you are completely right. Uh\n[segment 86] Unknown speaker: the main reason that uh manual testing\n[segment 87] Unknown speaker: doesn't work well is actually twofold.\n[segment 88] Unknown speaker: Uh first and uh it is very trivial uh to\n[segment 89] Unknown speaker: test all uh possible scenarios that we\n[segment 90] Unknown speaker: know or expect to work like golden pass\n[segment 91] Unknown speaker: or critical user journeys uh isn't\n[segment 92] Unknown speaker: effective when we speak about uh uh\n[segment 93] Unknown speaker: application or service at scale because\n[segment 94] Unknown speaker: uh there are too many and each time when\n[segment 95] Unknown speaker: we introduce a change or like submit uh\n[segment 96] Unknown speaker: a PR request or something like this\n[segment 97] Unknown speaker: [clears throat] to run it uh by people\n[segment 98] Unknown speaker: it is just very very slow. Another\n[segment 99] Unknown speaker: reason for this is that uh when people\n[segment 100] Unknown speaker: run it uh again I don't know your\n[segment 101] Unknown speaker: experience with QA teams. Uh in my uh\n[segment 102] Unknown speaker: previous uh workplaces QA teams were uh\n[segment 103] Unknown speaker: filing some uh test reports besides the\n[segment 104] Unknown speaker: bug that uh they found describing what\n[segment 105] Unknown speaker: they actually test. They had test plan\n[segment 106] Unknown speaker: documents that they checked. I did this\n[segment 107] Unknown speaker: test. I did that test. Uh but in many\n[segment 108] Unknown speaker: cases still a lot of uh stuff remains\n[segment 109] Unknown speaker: behind the scene and so all this test\n[segment 110] Unknown speaker: sessions uh kind of ephemeral. Uh you\n[segment 111] Unknown speaker: can't like track what actually was done.\n[segment 112] Unknown speaker: Uh you can't uh compare against previous\n[segment 113] Unknown speaker: executions and success. And this is why\n[segment 114] Unknown speaker: we actually moved uh from uh uh like uh\n[segment 115] Unknown speaker: focus on manual test by the teams to\n[segment 116] Unknown speaker: more automated approach.\n[segment 117] Unknown speaker: >> I see. I see. So um now if we look at\n[segment 118] Unknown speaker: this um testing software that that we do\n[segment 119] Unknown speaker: or evaluation right uh with AI and\n[segment 120] Unknown speaker: software is similar but with with\n[segment 121] Unknown speaker: generative AI it it's kind of a common\n[segment 122] Unknown speaker: thing and it comes from traditional\n[segment 123] Unknown speaker: machine learning uh we do evaluate stuff\n[segment 124] Unknown speaker: um during development right during the\n[segment 125] Unknown speaker: experimentation stage and uh we we\n[segment 126] Unknown speaker: change our prompts and then we see\n[segment 127] Unknown speaker: whether\n[segment 128] Unknown speaker: you know agents behave better or worse.\n[segment 129] Unknown speaker: Um so now um what's the difference\n[segment 130] Unknown speaker: between this experimentation and really\n[segment 131] Unknown speaker: the the evaluation in uh in the form of\n[segment 132] Unknown speaker: you know where traditional unit tests\n[segment 133] Unknown speaker: are used.\n[segment 134] Unknown speaker: So uh as uh we said in general the\n[segment 135] Unknown speaker: manual testing especially for AI it is\n[segment 136] Unknown speaker: uh much more I would say uh relevant\n[segment 137] Unknown speaker: uh manual testing uh for AI phase when\n[segment 138] Unknown speaker: we try to tweak uh our configuration for\n[segment 139] Unknown speaker: AI we don't actually code AI right we\n[segment 140] Unknown speaker: don't uh modify code for the model in\n[segment 141] Unknown speaker: general\n[segment 142] Unknown speaker: >> we still can we still can right\n[segment 143] Unknown speaker: >> we can but uh when I speak about\n[segment 144] Unknown speaker: software developers not data engineers\n[segment 145] Unknown speaker: not AI science scientists\n[segment 146] Unknown speaker: uh we uh adjust some configuration like\n[segment 147] Unknown speaker: uh there is a parameter called\n[segment 148] Unknown speaker: temperature which describes how much we\n[segment 149] Unknown speaker: allow to uh the model to uh experiment\n[segment 150] Unknown speaker: or go to uh some discovery interface uh\n[segment 151] Unknown speaker: and there are other things like uh\n[segment 152] Unknown speaker: system instructions, prompts, uh\n[segment 153] Unknown speaker: additional uh information that we can uh\n[segment 154] Unknown speaker: plug to uh the system or some logic uh\n[segment 155] Unknown speaker: provided with the agent configuration.\n[segment 156] Unknown speaker: uh so this is kind of uh the phase that\n[segment 157] Unknown speaker: we uh can call discovery when we uh\n[segment 158] Unknown speaker: experiment and uh it is hard to uh\n[segment 159] Unknown speaker: algorithmize\n[segment 160] Unknown speaker: uh experiments. So it is naturally uh\n[segment 161] Unknown speaker: have a creative and open-ending uh natur\n[segment 162] Unknown speaker: uh approach which uh makes sense to do\n[segment 163] Unknown speaker: manually uh to have a a person uh doing\n[segment 164] Unknown speaker: it versus the agent. uh on another side\n[segment 165] Unknown speaker: we have uh our CI/CD process that have\n[segment 166] Unknown speaker: to be rigid and repetitive and this\n[segment 167] Unknown speaker: process\n[segment 168] Unknown speaker: >> yeah ju just for for the audience like\n[segment 169] Unknown speaker: when we say CI/CD we mean like\n[segment 170] Unknown speaker: continuous integration\n[segment 171] Unknown speaker: >> and continuous deployment as far as I\n[segment 172] Unknown speaker: remember right\n[segment 173] Unknown speaker: >> and this is where\n[segment 174] Unknown speaker: >> yeah this is where effectively our uh\n[segment 175] Unknown speaker: software in this case our agent is sort\n[segment 176] Unknown speaker: of like done or almost done right in and\n[segment 177] Unknown speaker: it's in the process of final tweaks or\n[segment 178] Unknown speaker: it's in the process of being uh\n[segment 179] Unknown speaker: continuously improved because the you\n[segment 180] Unknown speaker: know this is what we do we continuously\n[segment 181] Unknown speaker: improve our software we test new models\n[segment 182] Unknown speaker: we do little tweaks uh with our prompts\n[segment 183] Unknown speaker: uh or systems instruction etc etc right\n[segment 184] Unknown speaker: so I\n[segment 185] Unknown speaker: >> I would say yes and no actually because\n[segment 186] Unknown speaker: uh when we say agent uh today uh this uh\n[segment 187] Unknown speaker: has certain ambiguity because the engine\n[segment 188] Unknown speaker: includes some interaction with the\n[segment 189] Unknown speaker: model. I can simply change the model I\n[segment 190] Unknown speaker: use uh and start using a different model\n[segment 191] Unknown speaker: which can completely change uh the\n[segment 192] Unknown speaker: behavior of the overall application. But\n[segment 193] Unknown speaker: I can also go and uh add uh some uh\n[segment 194] Unknown speaker: tools uh some instrument in instrument\n[segment 195] Unknown speaker: that model can invoke through uh some uh\n[segment 196] Unknown speaker: thinking process. Uh the model can uh be\n[segment 197] Unknown speaker: made aware by the agent that it can uh\n[segment 198] Unknown speaker: use some tools and these tools\n[segment 199] Unknown speaker: essentially implemented as uh simple","segments":[{"id":"b4432dc6-f68c-41c0-b7ac-37f443fc2ccc","segment_index":0,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"[music]"},{"id":"9ebe4eb9-22f4-4bd1-803f-5d580601ed78","segment_index":1,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"[music]"},{"id":"e03d98f7-5911-4433-a91c-9a4d461e50bd","segment_index":2,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"We are looking at a 5,000 share order."},{"id":"83de807e-9e98-4d17-abdc-1043dd8bea87","segment_index":3,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"[music]"},{"id":"4cd7a733-30e7-4645-8767-cd72bc45e3c0","segment_index":4,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"Hello and welcome back to another"},{"id":"f929527d-c295-43c8-914a-691d712ff6f9","segment_index":5,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"episode of Google Cloud Live. I'm Vlad"},{"id":"9a884ee1-bf25-4b9c-85b9-469084b83c57","segment_index":6,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"Kallesnika, developer relations engineer"},{"id":"c9f83d15-598d-4dcb-8282-5385b7177efc","segment_index":7,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"at Google Cloud and today I'm here with"},{"id":"b66496e0-b5ac-40de-9138-b101d3726abe","segment_index":8,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"Leonid."},{"id":"6a88d2fe-cccf-48bd-897f-f8245435c475","segment_index":9,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"Leonid."},{"id":"69ecee5d-bb2a-4fc3-a001-22d252b67151","segment_index":10,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":">> Hi. Hello Vlad. Uh thank you. Uh I'm"},{"id":"ce3e3a63-3814-48a2-8677-a288c1c008c9","segment_index":11,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"Leonid. I am a developer advocate at"},{"id":"9dffe3e5-3985-4c9c-9b08-c35bfca37709","segment_index":12,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"Google Cloud working with Vlad uh in the"},{"id":"14249871-9b62-4c6c-9e67-057e81e444bd","segment_index":13,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"same team. Uh I specialize on uh AI"},{"id":"03cacbb3-483b-46df-8066-06b38928c7ca","segment_index":14,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"workload uh security and observability."},{"id":"513c32ae-7520-40ee-b423-6b282eefedf0","segment_index":15,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"But today I will be talking to you about"},{"id":"a09cba43-9f9d-4a46-bcd6-ea110567edc7","segment_index":16,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"uh data and agent evaluation."},{"id":"fa6dffc9-217f-4da6-bd87-fc5ccb7e0e8c","segment_index":17,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":">> Agent evaluation you know we say from"},{"id":"06855ead-a00f-4998-8549-66e93a44843c","segment_index":18,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"VIP checks to datadriven evaluations. Uh"},{"id":"a2e14e24-b90a-452b-bb95-68920e39fac1","segment_index":19,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"what do you think VIP checks are? How"},{"id":"8662d7d1-ff33-4b87-a7a5-75b1c13e1597","segment_index":20,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"would you describe them?"},{"id":"3ff1df52-7171-4cc6-8c05-76ec8f7da862","segment_index":21,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":">> Thank you. Uh and let me switch to the"},{"id":"8613eabd-1a48-4a0b-afc6-9c5cfee328e2","segment_index":22,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"slides for a moment. uh so when we uh"},{"id":"bda70349-ba7c-47d9-9584-53af1c4e7786","segment_index":23,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"speak about uh V evaluation"},{"id":"c252482d-6d4b-4337-b1a1-8413bd4ae5d9","segment_index":24,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"uh or before that uh I want to address"},{"id":"efdc4d31-1a26-4270-aa87-b67e6a77acb2","segment_index":25,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"is a question why do we need to evaluate"},{"id":"c822c0eb-13a6-46ea-bf22-877a75363ce2","segment_index":26,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"AI agents in the first place today"},{"id":"fd30ef16-fc6c-4626-85d8-ae931dee4ceb","segment_index":27,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"perception is that AI is smart enough uh"},{"id":"f577555e-1305-4a72-9ede-ebb6e0921c23","segment_index":28,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"to do the work and uh so why do we need"},{"id":"71ef2f54-fd16-4670-88d5-d96f52b46130","segment_index":29,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"uh to test AI uh the main problem uh is"},{"id":"5ab1cf77-6cce-49bc-b48e-7ac3d6e1bc82","segment_index":30,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"that uh Today people don't believe AI"},{"id":"f3706641-8498-4d73-810e-3500a85db071","segment_index":31,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"like they did in the beginning when it"},{"id":"1503a8d2-05e9-48a7-8dfc-569481082b11","segment_index":32,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"just came. Uh I believe that our"},{"id":"a06a28ea-29a7-4b34-97fe-43398b1cdf2f","segment_index":33,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"audience also don't think that uh AI is"},{"id":"c0dc05f3-3393-4d2a-92d0-b47d5221735d","segment_index":34,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"perfect and never makes mistakes."},{"id":"b80012e7-6cdd-4674-b6b6-b6bc844ffa49","segment_index":35,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"So uh when we speak about wipe check we"},{"id":"d608f5ce-d13e-4dc8-abb0-c9c234038b81","segment_index":36,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"usually speak about uh testings that we"},{"id":"6c714a81-857e-4bf0-bf21-6fac69d86e2b","segment_index":37,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"do manually. We run uh our application,"},{"id":"b2a5bc2f-e61e-472a-b584-7cf75ce78b13","segment_index":38,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"our agent. We ask a few questions uh by"},{"id":"7e0bd545-fddf-4a7b-85ec-3253c0f11826","segment_index":39,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"manually typing or feeding the"},{"id":"f118046c-42c2-4cdc-9c8f-70f7eaa686be","segment_index":40,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"information to the agent. And uh then"},{"id":"9f5302f3-d176-46ee-85cb-557ac27d8960","segment_index":41,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"depending what kind of answers we get."},{"id":"2a58b5a9-804e-4af3-aa1b-30303a1a8b2b","segment_index":42,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"If it is good, we assume that everything"},{"id":"3a8a42a9-0a7f-4785-841b-fc73dd7a50ed","segment_index":43,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"is works and if they are bad, we return"},{"id":"8268cdca-d32e-4eb0-b600-e4959d06f09f","segment_index":44,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"back to the uh code or whatever"},{"id":"27c283c7-422a-4946-a8c5-0058ad921d21","segment_index":45,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"configuration of the agent uh we work"},{"id":"1f123926-1022-4e25-bb2d-493526246a1a","segment_index":46,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"with and we try to adjust some uh"},{"id":"bc952b5e-0879-4ed3-97dc-6587a37d36b2","segment_index":47,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"settings, configurations, prompts and so"},{"id":"73db69f1-f6b7-4210-9c32-116a8cd78fee","segment_index":48,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"on. So basically, you know, it it's kind"},{"id":"d5462128-f2c6-4d0d-b523-179c24b88c0d","segment_index":49,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"of similar to testing um normal software"},{"id":"7fddd214-9cc6-4d57-a57f-d0ce96a44ca8","segment_index":50,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"and you know, if if you just write a a"},{"id":"4bb9029a-dc8c-4537-9c7b-b74e6a5227ad","segment_index":51,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"regular deterministic software, you can"},{"id":"b5c95eb9-5785-4a13-b367-70d17fa0aaba","segment_index":52,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"make a change and then you can just like"},{"id":"5360a8d4-cde8-4309-b595-16a9098f474b","segment_index":53,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"run your application or your service,"},{"id":"2e74320e-1389-42e0-83f7-335b93cecfe4","segment_index":54,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"try a few things like your happy paths,"},{"id":"11dd478d-c3f2-4c97-b93e-b580244b5e89","segment_index":55,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"see that they work and you just like"},{"id":"6c77d1f6-d2c2-45e3-8b28-b6b88c5840c0","segment_index":56,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"push it to production. And we know that"},{"id":"45a99fad-5b03-42fe-9d24-1b8295a9bb6d","segment_index":57,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"it often breaks with AI, I guess, uh at"},{"id":"a00598a7-50f9-483b-b305-a05990425625","segment_index":58,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"generative AI these days. uh it's even"},{"id":"fb647904-963b-4fd1-af8e-6b3bab51efdb","segment_index":59,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"worse because the AI is like generative"},{"id":"2e2c5b17-933a-49aa-a602-6a718c6b985b","segment_index":60,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"AI is fundamentally not deterministic."},{"id":"14437f2b-33c4-4597-b7f5-c26d92178893","segment_index":61,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"So even when we try something uh and it"},{"id":"e16db46b-03fe-4f3f-8ceb-86a53723b172","segment_index":62,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"works at the moment that may not work uh"},{"id":"cdacfcac-96c3-48d2-9e91-1b7396d28ca5","segment_index":63,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"next time. So kind of"},{"id":"e92d3f47-01ce-49f8-922a-2f8defd040d2","segment_index":64,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":">> it is true."},{"id":"2a194155-bfe4-4baf-b67d-b541c0d5e385","segment_index":65,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":">> Yeah. So basically the importance of"},{"id":"afc8b476-9d13-440a-84da-c98e4149384f","segment_index":66,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"testing uh or checking um it increases"},{"id":"e78841cd-4850-4125-95c4-2e77a3543d8c","segment_index":67,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"and why why don't we just like test it"},{"id":"8efd2249-62b3-48f9-9f49-3ad008d7307f","segment_index":68,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"manually right like like we do with the"},{"id":"d5153b7f-1450-4d8b-a6df-b7d22c81b931","segment_index":69,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"with the regular software remember like"},{"id":"1d5cc4af-a14c-4a11-a900-d4c8f1746f3f","segment_index":70,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"in the past we had all those testers and"},{"id":"cfbe46b6-796a-48c0-9bbc-69a2d015adb8","segment_index":71,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"and then uh we end up um having"},{"id":"2daa1339-8895-451d-ae00-b004f01e953f","segment_index":72,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"automated tests and then uh people"},{"id":"c439bb3c-f4a6-47b6-b49c-66ad229593b0","segment_index":73,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"started talking about unit tests um so"},{"id":"30ee077e-9ae9-4b2b-a9ae-4a9429633c30","segment_index":74,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"that's kind of you know that this is all"},{"id":"59aa0c13-f93b-4ac9-9636-e5dd9613a106","segment_index":75,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"applies u to traditional software What"},{"id":"a6631fda-4144-4b20-9b1e-70382aa9290d","segment_index":76,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"about AI these days?"},{"id":"d2d39e61-2e70-48b7-8839-265bc8dfba7e","segment_index":77,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":">> So, uh I have news for you. We still"},{"id":"7b593537-9ad3-408f-a428-896ec9db7953","segment_index":78,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"have this people."},{"id":"0731cc47-ab51-4a4a-aba9-4d6837f848b9","segment_index":79,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"We still have K departments that do uh a"},{"id":"2bd29d4e-8892-4ce9-ab9e-02baea0693e3","segment_index":80,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"lot of uh tests manually like uh running"},{"id":"5b8bbd3a-2f81-4556-bb8e-2c4bdf181df3","segment_index":81,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"by themselves. Uh and to be honest, as a"},{"id":"1cfb0eb2-9fb0-4a94-863c-67ba6e95070e","segment_index":82,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"developer advocate, I often do this kind"},{"id":"b248e552-a00a-45f5-871a-b366d5339579","segment_index":83,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"of uh test by running user journeys on"},{"id":"d362a691-bdbd-4ade-8291-7dbe0a2840b9","segment_index":84,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"our products that yet to be released."},{"id":"c79834f3-4eea-499f-9402-2c9a965d2817","segment_index":85,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"However, you are completely right. Uh"},{"id":"cadeeae8-656d-498e-b394-b9ca4a9d7e16","segment_index":86,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"the main reason that uh manual testing"},{"id":"768f7519-6e24-49a4-9387-3efd3d8ea6c5","segment_index":87,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"doesn't work well is actually twofold."},{"id":"e420a310-2704-4ad5-b683-49c0755f2315","segment_index":88,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"Uh first and uh it is very trivial uh to"},{"id":"a736ea27-6870-43a1-8b14-d186f3b69d21","segment_index":89,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"test all uh possible scenarios that we"},{"id":"0efb0d3f-ab83-4c33-9286-85d75efdf68b","segment_index":90,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"know or expect to work like golden pass"},{"id":"fc147f97-21f0-4603-9aff-9ff7fc0fdc10","segment_index":91,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"or critical user journeys uh isn't"},{"id":"1a4519cc-a8b4-4b18-bdfb-f35415fcc6e3","segment_index":92,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"effective when we speak about uh uh"},{"id":"2869e639-d9b6-4b67-ace1-fcd1c614c690","segment_index":93,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"application or service at scale because"},{"id":"b217b66c-5813-466e-b05c-9aaaa4e120c5","segment_index":94,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"uh there are too many and each time when"},{"id":"b02cf869-42ff-4548-820a-df6b5322274a","segment_index":95,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"we introduce a change or like submit uh"},{"id":"7311c3fa-d914-4330-bdeb-d29e26852058","segment_index":96,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"a PR request or something like this"},{"id":"ab30c9ec-42ba-4dd7-97e1-329d113e22b8","segment_index":97,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"[clears throat] to run it uh by people"},{"id":"10c4ff4d-beea-43f4-9039-fad10acad0c1","segment_index":98,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"it is just very very slow. Another"},{"id":"936ac8a3-9c54-4304-a2d9-de55926acae5","segment_index":99,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"reason for this is that uh when people"},{"id":"e37ca975-56f9-40cd-a9a8-b931c7f701e4","segment_index":100,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"run it uh again I don't know your"},{"id":"b7d9e46b-ee03-4a25-8d78-c8d136bc00c6","segment_index":101,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"experience with QA teams. Uh in my uh"},{"id":"eb8ad0d0-1144-4406-8d42-d971431c4447","segment_index":102,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"previous uh workplaces QA teams were uh"},{"id":"d90871c6-3591-40e6-acd4-8d4ecc73a76f","segment_index":103,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"filing some uh test reports besides the"},{"id":"2cbdfabb-803e-4287-ad1c-6791c1ec1bc6","segment_index":104,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"bug that uh they found describing what"},{"id":"f42afea0-a66d-47f2-b2a8-f78c01d22f5e","segment_index":105,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"they actually test. They had test plan"},{"id":"014b2e76-ef4e-4b0d-9e6b-ecefbfcfc721","segment_index":106,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"documents that they checked. I did this"},{"id":"d3583461-206a-4aad-9f85-7765dea3c1f1","segment_index":107,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"test. I did that test. Uh but in many"},{"id":"7945f476-b9d6-45c3-8fea-b64e0924e1ab","segment_index":108,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"cases still a lot of uh stuff remains"},{"id":"22a6000c-e60c-415a-9925-882f87b4c52d","segment_index":109,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"behind the scene and so all this test"},{"id":"ac79ef58-7b71-4869-8d3b-187a8f6b3b07","segment_index":110,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"sessions uh kind of ephemeral. Uh you"},{"id":"5b0fd7a2-0254-433b-a964-96df89ddbac8","segment_index":111,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"can't like track what actually was done."},{"id":"2d078e0e-eac2-4c07-9c54-cc27d1c4e5d8","segment_index":112,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"Uh you can't uh compare against previous"},{"id":"fad882fe-03ff-4d86-a782-2854a568ead0","segment_index":113,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"executions and success. And this is why"},{"id":"25244ea9-0ae7-4574-bdbe-aea5cc2537a5","segment_index":114,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"we actually moved uh from uh uh like uh"},{"id":"ff29c22c-2d0c-4a37-8d74-c3151987c87b","segment_index":115,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"focus on manual test by the teams to"},{"id":"5c7f5224-a34e-4351-b671-6c5043d8a8b3","segment_index":116,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"more automated approach."},{"id":"03d97075-fa36-42eb-a976-14271c5191e6","segment_index":117,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":">> I see. I see. So um now if we look at"},{"id":"3df91b37-2565-4bb8-b493-c68b5ca40c9b","segment_index":118,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"this um testing software that that we do"},{"id":"811aacf7-f6e1-4c62-a24b-2fa6d67406ad","segment_index":119,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"or evaluation right uh with AI and"},{"id":"fa2efc20-a2f6-4046-a717-27fb2fb4a854","segment_index":120,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"software is similar but with with"},{"id":"e7e25fc1-db15-4ead-8187-f8eb844d12ef","segment_index":121,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"generative AI it it's kind of a common"},{"id":"c18bb46c-3e02-42be-8c87-d463ca259979","segment_index":122,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"thing and it comes from traditional"},{"id":"7686de9d-23b9-4db8-9ae3-3de4c0e98e99","segment_index":123,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"machine learning uh we do evaluate stuff"},{"id":"dd40e031-bdd1-4a14-a7b2-6f2d49e30a1f","segment_index":124,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"um during development right during the"},{"id":"75d11bf5-b180-457d-951d-1a464ede1ed3","segment_index":125,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"experimentation stage and uh we we"},{"id":"404feadc-fdd5-42b4-ad5d-bb6369b61b71","segment_index":126,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"change our prompts and then we see"},{"id":"0cb69c6b-3225-4e74-97ca-c4124b99cd68","segment_index":127,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"whether"},{"id":"dca1f74d-7dd0-45b8-a363-0414796a2a6b","segment_index":128,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"you know agents behave better or worse."},{"id":"fcbca905-871f-465a-8f67-6db961cac877","segment_index":129,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"Um so now um what's the difference"},{"id":"b72c7ece-512c-4254-b701-0ce011b679a5","segment_index":130,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"between this experimentation and really"},{"id":"03386f64-5dac-4f1b-9a89-fdc18e12e75e","segment_index":131,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"the the evaluation in uh in the form of"},{"id":"2017bc2c-cd9f-436d-a8f9-e2a4b62132ac","segment_index":132,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"you know where traditional unit tests"},{"id":"e266b99a-66ff-456a-904a-4a4c7f512fe6","segment_index":133,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"are used."},{"id":"951a2a89-2fe8-4e13-bd21-c89770809558","segment_index":134,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"So uh as uh we said in general the"},{"id":"ac2deeed-ae11-4ea6-b3e5-118b578abcc1","segment_index":135,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"manual testing especially for AI it is"},{"id":"608d15da-614f-418a-8784-d6f1e89ca4a2","segment_index":136,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"uh much more I would say uh relevant"},{"id":"61ab7193-124f-453e-91f9-183179ddc55f","segment_index":137,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"uh manual testing uh for AI phase when"},{"id":"1942a81c-f236-4dc4-b333-91998781c88b","segment_index":138,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"we try to tweak uh our configuration for"},{"id":"6a0ec41b-77a0-4e2a-a5fe-a0dfe2fa08ab","segment_index":139,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"AI we don't actually code AI right we"},{"id":"5d83e588-d071-40e0-8314-b90f687d526c","segment_index":140,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"don't uh modify code for the model in"},{"id":"48d4f0d0-343e-4b17-b510-2951ea5d0b11","segment_index":141,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"general"},{"id":"5a27e266-3664-4c01-bb2f-f6a2818256b3","segment_index":142,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":">> we still can we still can right"},{"id":"792c3640-9ad5-4093-8df7-b3009ed22c7f","segment_index":143,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":">> we can but uh when I speak about"},{"id":"82091c71-e8ed-4273-bf6a-334397bd22e7","segment_index":144,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"software developers not data engineers"},{"id":"698deba4-096c-48ca-8774-4403476e5a9d","segment_index":145,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"not AI science scientists"},{"id":"2079a555-d84a-4f53-8312-5c9883640aa7","segment_index":146,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"uh we uh adjust some configuration like"},{"id":"cc4b871e-bb30-4e4f-bb7a-0c813653cc88","segment_index":147,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"uh there is a parameter called"},{"id":"d72a8d8c-d1e8-4080-b229-892f94c519b5","segment_index":148,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"temperature which describes how much we"},{"id":"ed04aa6d-c4d8-47f2-904d-4a47bc6c1be0","segment_index":149,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"allow to uh the model to uh experiment"},{"id":"fae400bc-7ca7-4aa8-b8ed-1e582d6dd951","segment_index":150,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"or go to uh some discovery interface uh"},{"id":"5595978a-c176-42a4-a3b8-6952d93dab9f","segment_index":151,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"and there are other things like uh"},{"id":"8a91c895-d148-4ee0-aa4a-a3c6052251f6","segment_index":152,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"system instructions, prompts, uh"},{"id":"6e55daee-9882-4736-b3fe-f941613abdda","segment_index":153,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"additional uh information that we can uh"},{"id":"85f5012d-35dd-49d9-8815-67677171fa85","segment_index":154,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"plug to uh the system or some logic uh"},{"id":"63a377da-c80b-45cd-aa73-b4e91795bb7f","segment_index":155,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"provided with the agent configuration."},{"id":"19bf40f6-9162-40de-94c3-53bf32e81c96","segment_index":156,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"uh so this is kind of uh the phase that"},{"id":"59b851c8-4bc0-446c-9f4b-40511a3b0be0","segment_index":157,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"we uh can call discovery when we uh"},{"id":"44a41522-4848-48e6-90e5-ea799fced6d3","segment_index":158,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"experiment and uh it is hard to uh"},{"id":"f636def5-aad8-49c0-8386-37037ed88b4b","segment_index":159,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"algorithmize"},{"id":"e972e4f1-aee7-45d8-b9b4-82a56d4722ed","segment_index":160,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"uh experiments. So it is naturally uh"},{"id":"92b0fa82-fcad-4d1a-9aa9-085c9739c721","segment_index":161,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"have a creative and open-ending uh natur"},{"id":"025cc2fe-1559-4e7d-8dd5-8ec588bd27df","segment_index":162,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"uh approach which uh makes sense to do"},{"id":"e4013c89-3af7-48f1-a1e3-d1439d687277","segment_index":163,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"manually uh to have a a person uh doing"},{"id":"3d8eebe0-da2b-4b0b-8fc4-6f37219b6932","segment_index":164,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"it versus the agent. uh on another side"},{"id":"7203da01-01ec-417a-bf19-ffcd5753e2b2","segment_index":165,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"we have uh our CI/CD process that have"},{"id":"56d46a71-9af8-443d-bfe3-6eae78aeb309","segment_index":166,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"to be rigid and repetitive and this"},{"id":"fe2620d9-1ed6-44fb-98a9-89122a1e2ebd","segment_index":167,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"process"},{"id":"19508104-5e37-4c11-995d-d10f686e0f66","segment_index":168,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":">> yeah ju just for for the audience like"},{"id":"9d8901e6-7cc9-4238-91a9-69cb648c1d8d","segment_index":169,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"when we say CI/CD we mean like"},{"id":"2d98b355-4048-4136-8352-90cfc3f0a065","segment_index":170,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"continuous integration"},{"id":"f3a3397f-e4c3-4e74-b588-25c9f486bd3f","segment_index":171,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":">> and continuous deployment as far as I"},{"id":"a437cc48-f22b-4064-9936-d4df23c35a58","segment_index":172,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"remember right"},{"id":"e3433f90-694b-40b9-8726-9cc0a67da06f","segment_index":173,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":">> and this is where"},{"id":"5c21b6aa-758b-4f7b-aa8e-daeb24a5ef7a","segment_index":174,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":">> yeah this is where effectively our uh"},{"id":"2e7e0366-0cb1-455e-a5a0-fc6373d6d40e","segment_index":175,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"software in this case our agent is sort"},{"id":"f7fccf12-6dce-4a69-9b95-891438bcd251","segment_index":176,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"of like done or almost done right in and"},{"id":"894fa08e-957b-4c82-a8b5-59095b7782e1","segment_index":177,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"it's in the process of final tweaks or"},{"id":"744e6568-bffb-4a54-811a-7b3c756ad248","segment_index":178,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"it's in the process of being uh"},{"id":"0d5a7136-61d6-42f8-a765-84e3cae6b3e4","segment_index":179,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"continuously improved because the you"},{"id":"b8395a11-53dc-425e-96db-7d33cfe1a6bc","segment_index":180,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"know this is what we do we continuously"},{"id":"40b8aec9-9418-4b73-aae2-11e302331169","segment_index":181,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"improve our software we test new models"},{"id":"2070fa6e-6b85-4276-b9e7-d74cc79f24af","segment_index":182,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"we do little tweaks uh with our prompts"},{"id":"5dd7ddfb-afb5-44ae-9976-ea9b2fa04325","segment_index":183,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"uh or systems instruction etc etc right"},{"id":"f07cf83f-3852-447b-8bd4-c233ab890cd9","segment_index":184,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"so I"},{"id":"b05cc37e-78f6-4d7b-89fc-4adf96141b1a","segment_index":185,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":">> I would say yes and no actually because"},{"id":"de8d9705-65f9-4193-b5f6-ce0642d65645","segment_index":186,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"uh when we say agent uh today uh this uh"},{"id":"b70c59e7-b80e-4bef-beb3-d20e4a4f165f","segment_index":187,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"has certain ambiguity because the engine"},{"id":"fcc3d62d-6f38-458a-a759-f3f8d4aa9943","segment_index":188,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"includes some interaction with the"},{"id":"7a9cf30c-6bea-4014-a61d-58b3bc503241","segment_index":189,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"model. I can simply change the model I"},{"id":"9eceddd5-c192-42bf-ac0e-a90144f6298d","segment_index":190,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"use uh and start using a different model"},{"id":"cbc7505d-5eef-4943-bfaf-66fb206cf691","segment_index":191,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"which can completely change uh the"},{"id":"c5d3a969-ccab-45b0-b2cb-45eb9ce488cd","segment_index":192,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"behavior of the overall application. But"},{"id":"37530f16-5327-4df1-83e0-ae6f50604e9d","segment_index":193,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"I can also go and uh add uh some uh"},{"id":"f8b1698f-4fbf-43f7-9b17-878f2a37ea06","segment_index":194,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"tools uh some instrument in instrument"},{"id":"d50e9111-910a-4ed4-b410-96f6def38aee","segment_index":195,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"that model can invoke through uh some uh"},{"id":"b98f7ae5-98c1-4168-9a11-9d31d647015c","segment_index":196,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"thinking process. Uh the model can uh be"},{"id":"4e077a35-f4cf-4a6c-97cd-22fc46e2a446","segment_index":197,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"made aware by the agent that it can uh"},{"id":"bad55562-5b34-4b47-adf2-f469e25108e3","segment_index":198,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"use some tools and these tools"},{"id":"5c7f7748-24cd-4b19-aaa9-3745c91747ac","segment_index":199,"speaker_name":null,"start_seconds":null,"end_seconds":null,"text":"essentially implemented as uh simple"}]},"content_assets":[]}