{"id":12782,"date":"2024-06-10T10:39:06","date_gmt":"2024-06-10T10:39:06","guid":{"rendered":"https:\/\/dailyai.com\/?p=12782"},"modified":"2024-06-10T10:39:06","modified_gmt":"2024-06-10T10:39:06","slug":"natural-plan-benchmarking-llms-on-natural-language-planning","status":"publish","type":"post","link":"https:\/\/dailyai.com\/da\/2024\/06\/natural-plan-benchmarking-llms-on-natural-language-planning\/","title":{"rendered":"NATURAL PLAN: Benchmarking af LLM'er inden for planl\u00e6gning af naturligt sprog"},"content":{"rendered":"<p><strong>Google DeepMind-forskere udviklede NATURAL PLAN, et benchmark til evaluering af LLM'ers evne til at planl\u00e6gge opgaver i den virkelige verden baseret p\u00e5 naturlige sprogbeskeder.<\/strong><\/p>\n<p>Den n\u00e6ste udvikling inden for AI er at f\u00e5 den til at forlade en chatplatform og p\u00e5tage sig agentroller for at udf\u00f8re opgaver p\u00e5 tv\u00e6rs af platforme p\u00e5 vores vegne. Men det er sv\u00e6rere, end det lyder.<\/p>\n<p>Planl\u00e6gningsopgaver som at planl\u00e6gge et m\u00f8de eller sammens\u00e6tte en ferieplan kan virke enkle for os. Mennesker er gode til at r\u00e6sonnere sig gennem flere trin og forudsige, om en fremgangsm\u00e5de vil f\u00f8re til det \u00f8nskede m\u00e5l eller ej.<\/p>\n<p>Du synes m\u00e5ske, det er nemt, men selv de bedste AI-modeller har sv\u00e6rt ved at planl\u00e6gge. Kan vi benchmarke dem for at se, hvilken LLM der er bedst til at planl\u00e6gge?<\/p>\n<p>NATURAL PLAN-benchmarket tester LLM'er p\u00e5 3 planl\u00e6gningsopgaver:<\/p>\n<ul>\n<li><strong>Planl\u00e6gning af rejse<\/strong> - Planl\u00e6gning af en rejseplan under fly- og destinationsbegr\u00e6nsninger<\/li>\n<li><strong>Planl\u00e6gning af m\u00f8der<\/strong> - Planl\u00e6gning af m\u00f8der med flere venner p\u00e5 forskellige steder<\/li>\n<li><strong>Planl\u00e6gning af kalender<\/strong> - Planl\u00e6gning af arbejdsm\u00f8der mellem flere personer p\u00e5 baggrund af eksisterende tidsplaner og forskellige begr\u00e6nsninger<\/li>\n<\/ul>\n<p>Eksperimentet begyndte med f\u00e5-skud-prompter, hvor modellerne fik 5 eksempler p\u00e5 prompter og tilsvarende korrekte svar. Derefter blev de bedt om at planl\u00e6gge opgaver af varierende sv\u00e6rhedsgrad.<\/p>\n<p>Her er et eksempel p\u00e5 en opfordring og en l\u00f8sning, der blev givet som et eksempel til modellerne:<\/p>\n<figure id=\"attachment_12784\" aria-describedby=\"caption-attachment-12784\" style=\"width: 1342px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-12784 size-full\" src=\"https:\/\/dailyai.com\/wp-content\/uploads\/2024\/06\/NATURAL-PLAN-Prompt-example.png\" alt=\"\" width=\"1342\" height=\"808\" srcset=\"https:\/\/dailyai.com\/wp-content\/uploads\/2024\/06\/NATURAL-PLAN-Prompt-example.png 1342w, https:\/\/dailyai.com\/wp-content\/uploads\/2024\/06\/NATURAL-PLAN-Prompt-example-300x181.png 300w, https:\/\/dailyai.com\/wp-content\/uploads\/2024\/06\/NATURAL-PLAN-Prompt-example-1024x617.png 1024w, https:\/\/dailyai.com\/wp-content\/uploads\/2024\/06\/NATURAL-PLAN-Prompt-example-768x462.png 768w, https:\/\/dailyai.com\/wp-content\/uploads\/2024\/06\/NATURAL-PLAN-Prompt-example-18x12.png 18w, https:\/\/dailyai.com\/wp-content\/uploads\/2024\/06\/NATURAL-PLAN-Prompt-example-60x36.png 60w\" sizes=\"auto, (max-width: 1342px) 100vw, 1342px\" \/><figcaption id=\"caption-attachment-12784\" class=\"wp-caption-text\">Et eksempel p\u00e5 en opfordring og en l\u00f8sning, der blev brugt i Trip Planning-eksperimentet. Kilde: arXiv<\/figcaption><\/figure>\n<h2>Resultater<\/h2>\n<p>Forskerne testede GPT-3.5, GPT-4, <a href=\"https:\/\/dailyai.com\/da\/2024\/05\/everything-you-need-to-know-about-openais-new-flagship-model-gpt-4o\/\">GPT-4o<\/a>, Gemini 1.5 Flash og <a href=\"https:\/\/dailyai.com\/da\/2024\/02\/google-plays-another-ai-card-in-the-form-of-gemini-1-5-pro\/\"><span class=\"noTranslate\" data-no-translation=\"\">Gemini<\/span> 1.5 Pro<\/a>Ingen af dem klarede sig s\u00e6rlig godt i disse tests.<\/p>\n<p>Resultaterne m\u00e5 dog v\u00e6re faldet i god jord p\u00e5 DeepMind-kontoret, da Gemini 1.5 Pro kom ud p\u00e5 toppen.<\/p>\n<figure id=\"attachment_12785\" aria-describedby=\"caption-attachment-12785\" style=\"width: 1302px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-12785 size-full\" src=\"https:\/\/dailyai.com\/wp-content\/uploads\/2024\/06\/NATURAL-PLAN-results.png\" alt=\"\" width=\"1302\" height=\"204\" srcset=\"https:\/\/dailyai.com\/wp-content\/uploads\/2024\/06\/NATURAL-PLAN-results.png 1302w, https:\/\/dailyai.com\/wp-content\/uploads\/2024\/06\/NATURAL-PLAN-results-300x47.png 300w, https:\/\/dailyai.com\/wp-content\/uploads\/2024\/06\/NATURAL-PLAN-results-1024x160.png 1024w, https:\/\/dailyai.com\/wp-content\/uploads\/2024\/06\/NATURAL-PLAN-results-768x120.png 768w, https:\/\/dailyai.com\/wp-content\/uploads\/2024\/06\/NATURAL-PLAN-results-18x3.png 18w, https:\/\/dailyai.com\/wp-content\/uploads\/2024\/06\/NATURAL-PLAN-results-60x9.png 60w\" sizes=\"auto, (max-width: 1302px) 100vw, 1302px\" \/><figcaption id=\"caption-attachment-12785\" class=\"wp-caption-text\">NATURAL PLAN benchmark-resultater. Kilde: arXiv<\/figcaption><\/figure>\n<p>Som forventet blev resultaterne eksponentielt d\u00e5rligere med mere komplekse opgaver, hvor antallet af personer eller byer blev \u00f8get. Se for eksempel, hvor hurtigt pr\u00e6cisionen faldt, da der blev tilf\u00f8jet flere personer til m\u00f8deplanl\u00e6gningstesten.<\/p>\n<figure id=\"attachment_12786\" aria-describedby=\"caption-attachment-12786\" style=\"width: 1330px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-12786 size-full\" src=\"https:\/\/dailyai.com\/wp-content\/uploads\/2024\/06\/NATURAL-PLANNING-results-vs-complexity.png\" alt=\"\" width=\"1330\" height=\"530\" srcset=\"https:\/\/dailyai.com\/wp-content\/uploads\/2024\/06\/NATURAL-PLANNING-results-vs-complexity.png 1330w, https:\/\/dailyai.com\/wp-content\/uploads\/2024\/06\/NATURAL-PLANNING-results-vs-complexity-300x120.png 300w, https:\/\/dailyai.com\/wp-content\/uploads\/2024\/06\/NATURAL-PLANNING-results-vs-complexity-1024x408.png 1024w, https:\/\/dailyai.com\/wp-content\/uploads\/2024\/06\/NATURAL-PLANNING-results-vs-complexity-768x306.png 768w, https:\/\/dailyai.com\/wp-content\/uploads\/2024\/06\/NATURAL-PLANNING-results-vs-complexity-18x7.png 18w, https:\/\/dailyai.com\/wp-content\/uploads\/2024\/06\/NATURAL-PLANNING-results-vs-complexity-60x24.png 60w\" sizes=\"auto, (max-width: 1330px) 100vw, 1330px\" \/><figcaption id=\"caption-attachment-12786\" class=\"wp-caption-text\">N\u00f8jagtigheden af resultaterne i m\u00f8deplanl\u00e6gningstesten forringedes eksponentielt, efterh\u00e5nden som sp\u00f8rgsm\u00e5lene blev mere komplekse. Kilde: arXiv<\/figcaption><\/figure>\n<p>Kan multi-shot prompting resultere i forbedret pr\u00e6cision? Forskningsresultaterne viser, at det kan det, men kun hvis modellen har et stort nok kontekstvindue.<\/p>\n<p>Gemini 1.5 Pros st\u00f8rre kontekstvindue g\u00f8r det muligt at udnytte flere eksempler i konteksten end GPT-modellerne.<\/p>\n<p>Forskerne fandt ud af, at en for\u00f8gelse af antallet af skud fra 1 til 800 i Trip Planning forbedrer n\u00f8jagtigheden af Gemini Pro 1.5 fra 2,7% til 39,9%.<\/p>\n<p><a href=\"https:\/\/arxiv.org\/pdf\/2406.04520\" target=\"_blank\" rel=\"noopener\">Avisen<\/a> \"Disse resultater viser det lovende ved planl\u00e6gning i kontekst, hvor de lange kontekstfunktioner g\u00f8r det muligt for LLM'er at udnytte yderligere kontekst til at forbedre planl\u00e6gningen.\"<\/p>\n<p>Et m\u00e6rkeligt resultat var, at GPT-4o var virkelig d\u00e5rlig til rejseplanl\u00e6gning. Forskerne fandt ud af, at den havde sv\u00e6rt ved at \"forst\u00e5 og respektere begr\u00e6nsningerne i flyforbindelser og rejsedatoer.\"<\/p>\n<p>Et andet m\u00e6rkeligt resultat var, at selvkorrektion f\u00f8rte til et betydeligt fald i modellernes ydeevne p\u00e5 tv\u00e6rs af alle modeller. N\u00e5r modellerne blev bedt om at tjekke deres arbejde og foretage rettelser, lavede de flere fejl.<\/p>\n<p>Interessant nok led de st\u00e6rkere modeller, s\u00e5som GPT-4 og Gemini 1.5 Pro, st\u00f8rre tab end GPT-3.5, n\u00e5r de selvkorrigerede.<\/p>\n<p>Agentisk AI er et sp\u00e6ndende perspektiv, og vi ser allerede nogle praktiske anvendelser i <a href=\"https:\/\/dailyai.com\/da\/2024\/05\/ai-agents-multimodal-phi-3-unveiled-at-microsoft-build-2024\/\">Microsoft <span class=\"noTranslate\" data-no-translation=\"\">Copilot<\/span> agenter<\/a>.<\/p>\n<p>Men resultaterne af NATURAL PLAN-benchmarkpr\u00f8verne viser, at der er et stykke vej endnu, f\u00f8r AI kan h\u00e5ndtere mere kompleks planl\u00e6gning.<\/p>\n<p>DeepMind-forskerne konkluderede, at \"NATURAL PLAN er meget sv\u00e6r at l\u00f8se for state-of-the-art-modeller.\"<\/p>\n<p>Det ser ud til, at AI ikke vil erstatte rejsebureauer og personlige assistenter helt endnu.<\/p>","protected":false},"excerpt":{"rendered":"<p>Google DeepMind-forskere udviklede NATURAL PLAN, et benchmark til evaluering af LLM'ers evne til at planl\u00e6gge opgaver i den virkelige verden baseret p\u00e5 naturligt sprog. Den n\u00e6ste udvikling inden for AI er at f\u00e5 den til at forlade en chatplatform og p\u00e5tage sig agentroller for at udf\u00f8re opgaver p\u00e5 tv\u00e6rs af platforme p\u00e5 vores vegne. Men det er sv\u00e6rere, end det lyder. Planl\u00e6gningsopgaver som at planl\u00e6gge et m\u00f8de eller sammens\u00e6tte en ferieplan kan virke enkle for os. Mennesker er gode til at r\u00e6sonnere gennem flere trin og forudsige, om en fremgangsm\u00e5de vil f\u00f8re til det \u00f8nskede m\u00e5l eller ej. Du vil m\u00e5ske opdage, at<\/p>","protected":false},"author":6,"featured_media":12787,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[84],"tags":[147,118],"class_list":["post-12782","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-industry","tag-deepmind","tag-llms"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.3 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>NATURAL PLAN: Benchmarking LLMs on natural language planning | DailyAI<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/dailyai.com\/da\/2024\/06\/natural-plan-benchmarking-llms-on-natural-language-planning\/\" \/>\n<meta property=\"og:locale\" content=\"da_DK\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"NATURAL PLAN: Benchmarking LLMs on natural language planning | DailyAI\" \/>\n<meta property=\"og:description\" content=\"Google DeepMind researchers developed NATURAL PLAN, a benchmark for evaluating the capability of LLMs to plan real-world tasks based on natural language prompts. The next evolution of AI is to have it leave the confines of a chat platform and take on agentic roles to complete tasks across platforms on our behalf. But that\u2019s harder than it sounds. Planning tasks like scheduling a meeting or compiling a holiday itinerary might seem simple for us. Humans are good at reasoning through multiple steps and predicting whether a course of action will accomplish the desired objective or not. You might find that\" \/>\n<meta property=\"og:url\" content=\"https:\/\/dailyai.com\/da\/2024\/06\/natural-plan-benchmarking-llms-on-natural-language-planning\/\" \/>\n<meta property=\"og:site_name\" content=\"DailyAI\" \/>\n<meta property=\"article:published_time\" content=\"2024-06-10T10:39:06+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/dailyai.com\/wp-content\/uploads\/2024\/06\/Planning.webp\" \/>\n\t<meta property=\"og:image:width\" content=\"1792\" \/>\n\t<meta property=\"og:image:height\" content=\"1024\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/webp\" \/>\n<meta name=\"author\" content=\"Eugene van der Watt\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@DailyAIOfficial\" \/>\n<meta name=\"twitter:site\" content=\"@DailyAIOfficial\" \/>\n<meta name=\"twitter:label1\" content=\"Skrevet af\" \/>\n\t<meta name=\"twitter:data1\" content=\"Eugene van der Watt\" \/>\n\t<meta name=\"twitter:label2\" content=\"Estimeret l\u00e6setid\" \/>\n\t<meta name=\"twitter:data2\" content=\"4 minutter\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"NewsArticle\",\"@id\":\"https:\\\/\\\/dailyai.com\\\/2024\\\/06\\\/natural-plan-benchmarking-llms-on-natural-language-planning\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/dailyai.com\\\/2024\\\/06\\\/natural-plan-benchmarking-llms-on-natural-language-planning\\\/\"},\"author\":{\"name\":\"Eugene van der Watt\",\"@id\":\"https:\\\/\\\/dailyai.com\\\/#\\\/schema\\\/person\\\/7ce525c6d0c79838b7cc7cde96993cfa\"},\"headline\":\"NATURAL PLAN: Benchmarking LLMs on natural language planning\",\"datePublished\":\"2024-06-10T10:39:06+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/dailyai.com\\\/2024\\\/06\\\/natural-plan-benchmarking-llms-on-natural-language-planning\\\/\"},\"wordCount\":606,\"publisher\":{\"@id\":\"https:\\\/\\\/dailyai.com\\\/#organization\"},\"image\":{\"@id\":\"https:\\\/\\\/dailyai.com\\\/2024\\\/06\\\/natural-plan-benchmarking-llms-on-natural-language-planning\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/dailyai.com\\\/wp-content\\\/uploads\\\/2024\\\/06\\\/Planning.webp\",\"keywords\":[\"DeepMind\",\"LLMS\"],\"articleSection\":[\"Industry\"],\"inLanguage\":\"da-DK\"},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/dailyai.com\\\/2024\\\/06\\\/natural-plan-benchmarking-llms-on-natural-language-planning\\\/\",\"url\":\"https:\\\/\\\/dailyai.com\\\/2024\\\/06\\\/natural-plan-benchmarking-llms-on-natural-language-planning\\\/\",\"name\":\"NATURAL PLAN: Benchmarking LLMs on natural language planning | DailyAI\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/dailyai.com\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/dailyai.com\\\/2024\\\/06\\\/natural-plan-benchmarking-llms-on-natural-language-planning\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/dailyai.com\\\/2024\\\/06\\\/natural-plan-benchmarking-llms-on-natural-language-planning\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/dailyai.com\\\/wp-content\\\/uploads\\\/2024\\\/06\\\/Planning.webp\",\"datePublished\":\"2024-06-10T10:39:06+00:00\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/dailyai.com\\\/2024\\\/06\\\/natural-plan-benchmarking-llms-on-natural-language-planning\\\/#breadcrumb\"},\"inLanguage\":\"da-DK\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/dailyai.com\\\/2024\\\/06\\\/natural-plan-benchmarking-llms-on-natural-language-planning\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"da-DK\",\"@id\":\"https:\\\/\\\/dailyai.com\\\/2024\\\/06\\\/natural-plan-benchmarking-llms-on-natural-language-planning\\\/#primaryimage\",\"url\":\"https:\\\/\\\/dailyai.com\\\/wp-content\\\/uploads\\\/2024\\\/06\\\/Planning.webp\",\"contentUrl\":\"https:\\\/\\\/dailyai.com\\\/wp-content\\\/uploads\\\/2024\\\/06\\\/Planning.webp\",\"width\":1792,\"height\":1024},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/dailyai.com\\\/2024\\\/06\\\/natural-plan-benchmarking-llms-on-natural-language-planning\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/dailyai.com\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"NATURAL PLAN: Benchmarking LLMs on natural language planning\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/dailyai.com\\\/#website\",\"url\":\"https:\\\/\\\/dailyai.com\\\/\",\"name\":\"DailyAI\",\"description\":\"Your Daily Dose of AI News\",\"publisher\":{\"@id\":\"https:\\\/\\\/dailyai.com\\\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/dailyai.com\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"da-DK\"},{\"@type\":\"Organization\",\"@id\":\"https:\\\/\\\/dailyai.com\\\/#organization\",\"name\":\"DailyAI\",\"url\":\"https:\\\/\\\/dailyai.com\\\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"da-DK\",\"@id\":\"https:\\\/\\\/dailyai.com\\\/#\\\/schema\\\/logo\\\/image\\\/\",\"url\":\"https:\\\/\\\/dailyai.com\\\/wp-content\\\/uploads\\\/2023\\\/06\\\/Daily-Ai_TL_colour.png\",\"contentUrl\":\"https:\\\/\\\/dailyai.com\\\/wp-content\\\/uploads\\\/2023\\\/06\\\/Daily-Ai_TL_colour.png\",\"width\":4501,\"height\":934,\"caption\":\"DailyAI\"},\"image\":{\"@id\":\"https:\\\/\\\/dailyai.com\\\/#\\\/schema\\\/logo\\\/image\\\/\"},\"sameAs\":[\"https:\\\/\\\/x.com\\\/DailyAIOfficial\",\"https:\\\/\\\/www.linkedin.com\\\/company\\\/dailyaiofficial\\\/\",\"https:\\\/\\\/www.youtube.com\\\/@DailyAIOfficial\"]},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/dailyai.com\\\/#\\\/schema\\\/person\\\/7ce525c6d0c79838b7cc7cde96993cfa\",\"name\":\"Eugene van der Watt\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"da-DK\",\"@id\":\"https:\\\/\\\/dailyai.com\\\/wp-content\\\/uploads\\\/2023\\\/07\\\/Eugine_Profile_Picture-96x96.png\",\"url\":\"https:\\\/\\\/dailyai.com\\\/wp-content\\\/uploads\\\/2023\\\/07\\\/Eugine_Profile_Picture-96x96.png\",\"contentUrl\":\"https:\\\/\\\/dailyai.com\\\/wp-content\\\/uploads\\\/2023\\\/07\\\/Eugine_Profile_Picture-96x96.png\",\"caption\":\"Eugene van der Watt\"},\"description\":\"Eugene comes from an electronic engineering background and loves all things tech. When he takes a break from consuming AI news you'll find him at the snooker table.\",\"sameAs\":[\"www.linkedin.com\\\/in\\\/eugene-van-der-watt-16828119\"],\"url\":\"https:\\\/\\\/dailyai.com\\\/da\\\/author\\\/eugene\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"NATURAL PLAN: Benchmarking af LLM'er p\u00e5 naturlig sprogplanl\u00e6gning | DailyAI","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/dailyai.com\/da\/2024\/06\/natural-plan-benchmarking-llms-on-natural-language-planning\/","og_locale":"da_DK","og_type":"article","og_title":"NATURAL PLAN: Benchmarking LLMs on natural language planning | DailyAI","og_description":"Google DeepMind researchers developed NATURAL PLAN, a benchmark for evaluating the capability of LLMs to plan real-world tasks based on natural language prompts. The next evolution of AI is to have it leave the confines of a chat platform and take on agentic roles to complete tasks across platforms on our behalf. But that\u2019s harder than it sounds. Planning tasks like scheduling a meeting or compiling a holiday itinerary might seem simple for us. Humans are good at reasoning through multiple steps and predicting whether a course of action will accomplish the desired objective or not. You might find that","og_url":"https:\/\/dailyai.com\/da\/2024\/06\/natural-plan-benchmarking-llms-on-natural-language-planning\/","og_site_name":"DailyAI","article_published_time":"2024-06-10T10:39:06+00:00","og_image":[{"width":1792,"height":1024,"url":"https:\/\/dailyai.com\/wp-content\/uploads\/2024\/06\/Planning.webp","type":"image\/webp"}],"author":"Eugene van der Watt","twitter_card":"summary_large_image","twitter_creator":"@DailyAIOfficial","twitter_site":"@DailyAIOfficial","twitter_misc":{"Skrevet af":"Eugene van der Watt","Estimeret l\u00e6setid":"4 minutter"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"NewsArticle","@id":"https:\/\/dailyai.com\/2024\/06\/natural-plan-benchmarking-llms-on-natural-language-planning\/#article","isPartOf":{"@id":"https:\/\/dailyai.com\/2024\/06\/natural-plan-benchmarking-llms-on-natural-language-planning\/"},"author":{"name":"Eugene van der Watt","@id":"https:\/\/dailyai.com\/#\/schema\/person\/7ce525c6d0c79838b7cc7cde96993cfa"},"headline":"NATURAL PLAN: Benchmarking LLMs on natural language planning","datePublished":"2024-06-10T10:39:06+00:00","mainEntityOfPage":{"@id":"https:\/\/dailyai.com\/2024\/06\/natural-plan-benchmarking-llms-on-natural-language-planning\/"},"wordCount":606,"publisher":{"@id":"https:\/\/dailyai.com\/#organization"},"image":{"@id":"https:\/\/dailyai.com\/2024\/06\/natural-plan-benchmarking-llms-on-natural-language-planning\/#primaryimage"},"thumbnailUrl":"https:\/\/dailyai.com\/wp-content\/uploads\/2024\/06\/Planning.webp","keywords":["DeepMind","LLMS"],"articleSection":["Industry"],"inLanguage":"da-DK"},{"@type":"WebPage","@id":"https:\/\/dailyai.com\/2024\/06\/natural-plan-benchmarking-llms-on-natural-language-planning\/","url":"https:\/\/dailyai.com\/2024\/06\/natural-plan-benchmarking-llms-on-natural-language-planning\/","name":"NATURAL PLAN: Benchmarking af LLM'er p\u00e5 naturlig sprogplanl\u00e6gning | DailyAI","isPartOf":{"@id":"https:\/\/dailyai.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/dailyai.com\/2024\/06\/natural-plan-benchmarking-llms-on-natural-language-planning\/#primaryimage"},"image":{"@id":"https:\/\/dailyai.com\/2024\/06\/natural-plan-benchmarking-llms-on-natural-language-planning\/#primaryimage"},"thumbnailUrl":"https:\/\/dailyai.com\/wp-content\/uploads\/2024\/06\/Planning.webp","datePublished":"2024-06-10T10:39:06+00:00","breadcrumb":{"@id":"https:\/\/dailyai.com\/2024\/06\/natural-plan-benchmarking-llms-on-natural-language-planning\/#breadcrumb"},"inLanguage":"da-DK","potentialAction":[{"@type":"ReadAction","target":["https:\/\/dailyai.com\/2024\/06\/natural-plan-benchmarking-llms-on-natural-language-planning\/"]}]},{"@type":"ImageObject","inLanguage":"da-DK","@id":"https:\/\/dailyai.com\/2024\/06\/natural-plan-benchmarking-llms-on-natural-language-planning\/#primaryimage","url":"https:\/\/dailyai.com\/wp-content\/uploads\/2024\/06\/Planning.webp","contentUrl":"https:\/\/dailyai.com\/wp-content\/uploads\/2024\/06\/Planning.webp","width":1792,"height":1024},{"@type":"BreadcrumbList","@id":"https:\/\/dailyai.com\/2024\/06\/natural-plan-benchmarking-llms-on-natural-language-planning\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/dailyai.com\/"},{"@type":"ListItem","position":2,"name":"NATURAL PLAN: Benchmarking LLMs on natural language planning"}]},{"@type":"WebSite","@id":"https:\/\/dailyai.com\/#website","url":"https:\/\/dailyai.com\/","name":"DailyAI","description":"Din daglige dosis af AI-nyheder","publisher":{"@id":"https:\/\/dailyai.com\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/dailyai.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"da-DK"},{"@type":"Organization","@id":"https:\/\/dailyai.com\/#organization","name":"DailyAI","url":"https:\/\/dailyai.com\/","logo":{"@type":"ImageObject","inLanguage":"da-DK","@id":"https:\/\/dailyai.com\/#\/schema\/logo\/image\/","url":"https:\/\/dailyai.com\/wp-content\/uploads\/2023\/06\/Daily-Ai_TL_colour.png","contentUrl":"https:\/\/dailyai.com\/wp-content\/uploads\/2023\/06\/Daily-Ai_TL_colour.png","width":4501,"height":934,"caption":"DailyAI"},"image":{"@id":"https:\/\/dailyai.com\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/DailyAIOfficial","https:\/\/www.linkedin.com\/company\/dailyaiofficial\/","https:\/\/www.youtube.com\/@DailyAIOfficial"]},{"@type":"Person","@id":"https:\/\/dailyai.com\/#\/schema\/person\/7ce525c6d0c79838b7cc7cde96993cfa","name":"Eugene van der Watt","image":{"@type":"ImageObject","inLanguage":"da-DK","@id":"https:\/\/dailyai.com\/wp-content\/uploads\/2023\/07\/Eugine_Profile_Picture-96x96.png","url":"https:\/\/dailyai.com\/wp-content\/uploads\/2023\/07\/Eugine_Profile_Picture-96x96.png","contentUrl":"https:\/\/dailyai.com\/wp-content\/uploads\/2023\/07\/Eugine_Profile_Picture-96x96.png","caption":"Eugene van der Watt"},"description":"Eugene har en baggrund som elektronikingeni\u00f8r og elsker alt, hvad der har med teknologi at g\u00f8re. N\u00e5r han tager en pause fra at l\u00e6se AI-nyheder, kan du finde ham ved snookerbordet.","sameAs":["www.linkedin.com\/in\/eugene-van-der-watt-16828119"],"url":"https:\/\/dailyai.com\/da\/author\/eugene\/"}]}},"_links":{"self":[{"href":"https:\/\/dailyai.com\/da\/wp-json\/wp\/v2\/posts\/12782","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/dailyai.com\/da\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dailyai.com\/da\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dailyai.com\/da\/wp-json\/wp\/v2\/users\/6"}],"replies":[{"embeddable":true,"href":"https:\/\/dailyai.com\/da\/wp-json\/wp\/v2\/comments?post=12782"}],"version-history":[{"count":3,"href":"https:\/\/dailyai.com\/da\/wp-json\/wp\/v2\/posts\/12782\/revisions"}],"predecessor-version":[{"id":12789,"href":"https:\/\/dailyai.com\/da\/wp-json\/wp\/v2\/posts\/12782\/revisions\/12789"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/dailyai.com\/da\/wp-json\/wp\/v2\/media\/12787"}],"wp:attachment":[{"href":"https:\/\/dailyai.com\/da\/wp-json\/wp\/v2\/media?parent=12782"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dailyai.com\/da\/wp-json\/wp\/v2\/categories?post=12782"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dailyai.com\/da\/wp-json\/wp\/v2\/tags?post=12782"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}