{"id":746,"date":"2025-01-30T13:05:33","date_gmt":"2025-01-30T13:05:33","guid":{"rendered":"https:\/\/janusai.pro\/?p=746"},"modified":"2025-01-30T13:05:35","modified_gmt":"2025-01-30T13:05:35","slug":"the-complete-explanation-from-deepseek-janus-to-janus-pro","status":"publish","type":"post","link":"https:\/\/janusai.pro\/cs\/the-complete-explanation-from-deepseek-janus-to-janus-pro\/","title":{"rendered":"Kompletn\u00ed vysv\u011btlen\u00ed: od DeepSeek Janus po Janus-Pro!"},"content":{"rendered":"<div style=\"margin-top: 0px; margin-bottom: 0px;\" class=\"sharethis-inline-share-buttons\" ><\/div>\n<p>Poselstv\u00ed dom\u016f: Janus je jednoduch\u00fd, jednotn\u00fd a roz\u0161i\u0159iteln\u00fd model multimod\u00e1ln\u00edho porozum\u011bn\u00ed a generov\u00e1n\u00ed, kter\u00fd odd\u011bluje multimod\u00e1ln\u00ed porozum\u011bn\u00ed a generovan\u00e9 vizu\u00e1ln\u00ed k\u00f3dov\u00e1n\u00ed, \u010d\u00edm\u017e zm\u00edr\u0148uje potenci\u00e1ln\u00ed konflikty mezi t\u011bmito dv\u011bma \u00fakoly. V budoucnu jej lze roz\u0161\u00ed\u0159it o dal\u0161\u00ed vstupn\u00ed modality. Janus-Pro stav\u00ed na tomto z\u00e1kladu optimalizac\u00ed tr\u00e9ninkov\u00e9 strategie (v\u010detn\u011b zv\u00fd\u0161en\u00ed po\u010dtu tr\u00e9ninkov\u00fdch krok\u016f, \u00fapravy pom\u011br\u016f dat atd.), p\u0159id\u00e1n\u00edm dal\u0161\u00edch dat (v\u010detn\u011b pou\u017eit\u00ed syntetick\u00fdch dat atd.) a zv\u011bt\u0161en\u00edm velikosti modelu (na 7 miliard parametr\u016f), co\u017e vede k pokroku ve schopnostech modelu v oblasti multimod\u00e1ln\u00edho porozum\u011bn\u00ed a dodr\u017eov\u00e1n\u00ed instrukc\u00ed pro p\u0159evod textu na obraz.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/rmy9ct2fln.feishu.cn\/space\/api\/box\/stream\/download\/asynccode\/?code=Mjg4MjEwYjVlNzk0YTgyMTc0NDJlODQ4MTU2ZmRjYTVfWnhaaVEyZlEwUHFrUHNUeGNCOWpCRU1EVDN0QktBMUxfVG9rZW46SkVQZmJmSEhqb1g4YTJ4MVNYdmNPT2oybmVmXzE3MzgyNDIwMzc6MTczODI0NTYzN19WNA\" alt=\"\"\/><\/figure>\n\n\n\n<p><a href=\"https:\/\/github.com\/deepseek-ai\/JanusJanus\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">Adresa k\u00f3du<\/a><\/p>\n\n\n\n<p><a href=\"https:\/\/github.com\/deepseek-ai\/Janus\/blob\/main\/janus_pro_tech_report.pdf\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">Adresa Janus Pro<\/a><\/p>\n\n\n\n<p><a href=\"https:\/\/huggingface.co\/deepseek-ai\/Janus-Pro-7B\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">Janus-Pro<\/a> je pokro\u010dilou verz\u00ed p\u0159edchoz\u00ed pr\u00e1ce Janus, konkr\u00e9tn\u011b zahrnuje (1) optimalizovanou tr\u00e9ninkovou strategii, (2) roz\u0161\u00ed\u0159en\u00e1 tr\u00e9ninkov\u00e1 data a (3) v\u011bt\u0161\u00ed velikosti model\u016f. D\u00edky t\u011bmto vylep\u0161en\u00edm dosahuje Janus-Pro v\u00fdznamn\u00e9ho pokroku ve schopnostech multimod\u00e1ln\u00edho porozum\u011bn\u00ed a dodr\u017eov\u00e1n\u00ed instrukc\u00ed pro p\u0159evod textu na obraz a z\u00e1rove\u0148 zvy\u0161uje stabilitu generov\u00e1n\u00ed p\u0159evodu textu na obraz. Ne\u017e se pust\u00edme do rozbalov\u00e1n\u00ed n\u00e1stroje Janus-Pro, projd\u011bme si program Janus.<\/p>\n\n\n\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_72 counter-hierarchy ez-toc-counter ez-toc-grey ez-toc-container-direction\">\n<div class=\"ez-toc-title-container\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Obsah<\/p>\n<span class=\"ez-toc-title-toggle\"><a href=\"#\" class=\"ez-toc-pull-right ez-toc-btn ez-toc-btn-xs ez-toc-btn-default ez-toc-toggle\" aria-label=\"P\u0159epnut\u00ed tabulky obsahu\"><span class=\"ez-toc-js-icon-con\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">P\u0159ep\u00edna\u010d<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewbox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewbox=\"0 0 24 24\" version=\"1.2\" baseprofile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/span><\/a><\/span><\/div>\n<nav><ul class='ez-toc-list ez-toc-list-level-1' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/janusai.pro\/cs\/the-complete-explanation-from-deepseek-janus-to-janus-pro\/#Reviewing_Janus\" title=\"Recenze Janus\">Recenze Janus<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/janusai.pro\/cs\/the-complete-explanation-from-deepseek-janus-to-janus-pro\/#Janus_training_is_divided_into_3_phases\" title=\"\u0160kolen\u00ed Janus je rozd\u011bleno do 3 f\u00e1z\u00ed:\">\u0160kolen\u00ed Janus je rozd\u011bleno do 3 f\u00e1z\u00ed:<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/janusai.pro\/cs\/the-complete-explanation-from-deepseek-janus-to-janus-pro\/#Phase_1\" title=\"F\u00e1ze 1\">F\u00e1ze 1<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/janusai.pro\/cs\/the-complete-explanation-from-deepseek-janus-to-janus-pro\/#Phase_2\" title=\"F\u00e1ze 2\">F\u00e1ze 2<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/janusai.pro\/cs\/the-complete-explanation-from-deepseek-janus-to-janus-pro\/#Phase_3\" title=\"F\u00e1ze 3\">F\u00e1ze 3<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/janusai.pro\/cs\/the-complete-explanation-from-deepseek-janus-to-janus-pro\/#Training_Objectives\" title=\"C\u00edle \u0161kolen\u00ed\">C\u00edle \u0161kolen\u00ed<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/janusai.pro\/cs\/the-complete-explanation-from-deepseek-janus-to-janus-pro\/#Reasoning\" title=\"Zd\u016fvodn\u011bn\u00ed\">Zd\u016fvodn\u011bn\u00ed<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/janusai.pro\/cs\/the-complete-explanation-from-deepseek-janus-to-janus-pro\/#Possible_extensions\" title=\"Mo\u017en\u00e1 roz\u0161\u00ed\u0159en\u00ed\">Mo\u017en\u00e1 roz\u0161\u00ed\u0159en\u00ed<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-9\" href=\"https:\/\/janusai.pro\/cs\/the-complete-explanation-from-deepseek-janus-to-janus-pro\/#Janus-Pro_Upgrade\" title=\"Upgrade Janus-Pro\">Upgrade Janus-Pro<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-10\" href=\"https:\/\/janusai.pro\/cs\/the-complete-explanation-from-deepseek-janus-to-janus-pro\/#Main_Improvements\" title=\"Hlavn\u00ed vylep\u0161en\u00ed\">Hlavn\u00ed vylep\u0161en\u00ed<\/a><ul class='ez-toc-list-level-4' ><li class='ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-11\" href=\"https:\/\/janusai.pro\/cs\/the-complete-explanation-from-deepseek-janus-to-janus-pro\/#Training_Strategy\" title=\"Strategie \u0161kolen\u00ed\">Strategie \u0161kolen\u00ed<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-12\" href=\"https:\/\/janusai.pro\/cs\/the-complete-explanation-from-deepseek-janus-to-janus-pro\/#Data_Scale\" title=\"M\u011b\u0159\u00edtko dat\">M\u011b\u0159\u00edtko dat<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-4'><a class=\"ez-toc-link ez-toc-heading-13\" href=\"https:\/\/janusai.pro\/cs\/the-complete-explanation-from-deepseek-janus-to-janus-pro\/#Model_Scale\" title=\"Modelov\u00e9 m\u011b\u0159\u00edtko\">Modelov\u00e9 m\u011b\u0159\u00edtko<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-14\" href=\"https:\/\/janusai.pro\/cs\/the-complete-explanation-from-deepseek-janus-to-janus-pro\/#Experimental_details\" title=\"Podrobnosti o experimentu\">Podrobnosti o experimentu<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-15\" href=\"https:\/\/janusai.pro\/cs\/the-complete-explanation-from-deepseek-janus-to-janus-pro\/#Insufficient\" title=\"Nedostate\u010dn\u00e9\">Nedostate\u010dn\u00e9<\/a><\/li><\/ul><\/li><\/ul><\/nav><\/div>\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Reviewing_Janus\"><\/span>Recenze Janus<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>P\u0159edch\u016fdce Janus je autoregresn\u00ed r\u00e1mec pro sjednocen\u00e9 multimod\u00e1ln\u00ed porozum\u011bn\u00ed a generov\u00e1n\u00ed, kter\u00fd se pou\u017e\u00edv\u00e1 k odd\u011blen\u00ed vizu\u00e1ln\u00edho k\u00f3dov\u00e1n\u00ed pro sjednocen\u00e9 multimod\u00e1ln\u00ed porozum\u011bn\u00ed a generov\u00e1n\u00ed. Pro multimod\u00e1ln\u00ed porozum\u011bn\u00ed je typick\u00fd n\u00e1vrh podle LLaVA, kter\u00fd pou\u017e\u00edv\u00e1 vizu\u00e1ln\u00ed kod\u00e9ry jako most umo\u017e\u0148uj\u00edc\u00ed velk\u00fdm jazykov\u00fdm model\u016fm porozum\u011bt obraz\u016fm. Pro generov\u00e1n\u00ed je obvykle zalo\u017een na difuzn\u00edch modelech a n\u011bkter\u00e9 jsou zalo\u017eeny na autoregresn\u00edch metod\u00e1ch. N\u011bkter\u00e9 p\u0159\u00edstupy se pokou\u0161ej\u00ed vyu\u017e\u00edt jedin\u00fd transform\u00e1tor, kter\u00fd se sna\u017e\u00ed sjednotit multimod\u00e1ln\u00ed \u00falohy porozum\u011bn\u00ed a generov\u00e1n\u00ed, a kter\u00fd obvykle vyu\u017e\u00edv\u00e1 jedin\u00fd vizu\u00e1ln\u00ed kod\u00e9r ke zpracov\u00e1n\u00ed vstup\u016f obou \u00faloh.<\/p>\n\n\n\n<p>Existuj\u00ed v\u0161ak rozd\u00edly v reprezentac\u00edch pot\u0159ebn\u00fdch pro multimod\u00e1ln\u00ed \u00falohy porozum\u011bn\u00ed a generov\u00e1n\u00ed. V \u00faloze multimod\u00e1ln\u00edho porozum\u011bn\u00ed se vizu\u00e1ln\u00ed kod\u00e9r zam\u011b\u0159uje na extrakci s\u00e9mantick\u00fdch informac\u00ed vysok\u00e9 \u00farovn\u011b (nap\u0159. kategori\u00ed objekt\u016f nebo vizu\u00e1ln\u00edch atribut\u016f) a v\u00fdstup zahrnuje nejen extrakci informac\u00ed z obrazu, ale tak\u00e9 komplexn\u00ed s\u00e9mantick\u00e9 uva\u017eov\u00e1n\u00ed, p\u0159i\u010dem\u017e kod\u00e9r se zam\u011b\u0159uje p\u0159edev\u0161\u00edm na vysokodimenzion\u00e1ln\u00ed s\u00e9mantick\u00e9 reprezentace. \u00daloha generov\u00e1n\u00ed se zab\u00fdv\u00e1 p\u0159edev\u0161\u00edm generov\u00e1n\u00edm lok\u00e1ln\u00edch detail\u016f a udr\u017eov\u00e1n\u00edm glob\u00e1ln\u00ed konzistence v obraze, a vy\u017eaduje tedy n\u00edzkorozm\u011brn\u00e9 k\u00f3dovan\u00e9 reprezentace prostorov\u00fdch struktur a texturn\u00edch detail\u016f. Sjednocen\u00ed reprezentac\u00ed obou \u00faloh ve stejn\u00e9m prostoru m\u016f\u017ee v\u00e9st ke konflikt\u016fm.<\/p>\n\n\n\n<p>Janus obsahuje 2 nez\u00e1visl\u00e9 cesty vizu\u00e1ln\u00edho k\u00f3dov\u00e1n\u00ed pro multimod\u00e1ln\u00ed porozum\u011bn\u00ed a generov\u00e1n\u00ed a p\u0159in\u00e1\u0161\u00ed dv\u011b v\u00fdhody: 1) zm\u00edr\u0148uje konflikty plynouc\u00ed z rozd\u00edln\u00fdch po\u017eadavk\u016f na granularitu multimod\u00e1ln\u00edho porozum\u011bn\u00ed a generov\u00e1n\u00ed a 2) je flexibiln\u00ed a \u0161k\u00e1lovateln\u00fd, odd\u011bluje se, tak\u017ee \u00falohy porozum\u011bn\u00ed i generov\u00e1n\u00ed lze k\u00f3dovat pomoc\u00ed nejmodern\u011bj\u0161\u00edch k\u00f3dovac\u00edch technik specifick\u00fdch pro jejich dom\u00e9ny a v budoucnu do nich lze vkl\u00e1dat mra\u010dna bod\u016f, EEG sign\u00e1ly nebo zvukov\u00e1 data a zpracov\u00e1vat je pomoc\u00ed jednotn\u00e9ho V budoucnu lze do nich vkl\u00e1dat mra\u010dna bod\u016f, EEG sign\u00e1ly nebo zvukov\u00e1 data a zpracov\u00e1vat je pomoc\u00ed jednotn\u00e9ho transform\u00e1toru.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/rmy9ct2fln.feishu.cn\/space\/api\/box\/stream\/download\/asynccode\/?code=OTE3ZjkyNWQ5MmUwNDQzM2VjN2VlNWYwZjAxYTVmZGRfMXpJMWVObDBKOHYxTVJqeEw2S0pHT2hGU3RuVHdnWVdfVG9rZW46UDQyQ2Jrb0Myb1h0bjR4TFBrV2NRS29GbkRmXzE3MzgyNDIwMzc6MTczODI0NTYzN19WNA\" alt=\"\"\/><\/figure>\n\n\n\n<p>Pro porozum\u011bn\u00ed textu je text p\u0159eveden na diskr\u00e9tn\u00ed ID pomoc\u00ed integrovan\u00e9ho tokeniz\u00e9ru LLM;<\/p>\n\n\n\n<p>Pro multimod\u00e1ln\u00ed porozum\u011bn\u00ed jsou v obrazech extrahov\u00e1ny vysokodimenzion\u00e1ln\u00ed s\u00e9mantick\u00e9 rysy pomoc\u00ed SigLIP kod\u00e9r\u016f (pozn. autora: Cosmos tak\u00e9 pou\u017e\u00edv\u00e1 SigLIP kod\u00e9ry v \u010d\u00e1sti Guardrails) a extrahovan\u00e9 rysy jsou mapov\u00e1ny do prostoru textov\u00fdch rys\u016f LLM pomoc\u00ed Adaptoru (dvouvrstv\u00fd MLP);<\/p>\n\n\n\n<p>Dlouh\u00e1 strana byla upravena na 384 pixel\u016f a kr\u00e1tk\u00e1 strana byla vypln\u011bna na 384 pixel\u016f pomoc\u00ed RGB(127, 127, 127);<\/p>\n\n\n\n<p>Pro vizu\u00e1ln\u00ed generov\u00e1n\u00ed byl obraz p\u0159eveden na diskr\u00e9tn\u00ed ID pomoc\u00ed Tokeniz\u00e1toru VQ a ka\u017ed\u00e9 ID bylo mapov\u00e1no do prostoru textov\u00fdch p\u0159\u00edznak\u016f LLM pomoc\u00ed Adaptoru (dvouvrstv\u00fd MLP);<\/p>\n\n\n\n<p>Kr\u00e1tk\u00e9 okraje byly zmen\u0161eny na 384 pixel\u016f a dlouh\u00e9 okraje byly o\u0159\u00edznuty na 384 pixel\u016f;<\/p>\n\n\n\n<p>Celkov\u00e9 tr\u00e9nov\u00e1n\u00ed prob\u00edhalo na 16 uzlech, z nich\u017e ka\u017ed\u00fd obsahoval 8 grafick\u00fdch procesor\u016f Nvidia A100;<\/p>\n\n\n\n<p>Pro \u00falohy vizu\u00e1ln\u00edho generov\u00e1n\u00ed i multimod\u00e1ln\u00edho porozum\u011bn\u00ed jsou sekvence obrazov\u00fdch a textov\u00fdch prvk\u016f spojeny dohromady jako vstup do LLM (v textu je pou\u017eit DeepSeek-LLM 1.3B);<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>Integrovan\u00e1 predik\u010dn\u00ed hlava LLM se pou\u017e\u00edv\u00e1 pro predikce textu v \u00faloh\u00e1ch porozum\u011bn\u00ed \u010dist\u00e9mu textu i multimod\u00e1ln\u00edmu porozum\u011bn\u00ed, zat\u00edmco n\u00e1hodn\u011b inicializovan\u00e1 predik\u010dn\u00ed hlava se pou\u017e\u00edv\u00e1 pro predikce obrazu v \u00faloze vizu\u00e1ln\u00edho generov\u00e1n\u00ed. Cel\u00fd model se dr\u017e\u00ed autoregresn\u00edho r\u00e1mce bez pot\u0159eby speci\u00e1ln\u011b navr\u017een\u00fdch masek pozornosti.<\/p>\n<\/blockquote>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Janus_training_is_divided_into_3_phases\"><\/span><a href=\"https:\/\/huggingface.co\/deepseek-ai\/Janus-Pro-7B\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">\u0160kolen\u00ed Janus<\/a> je rozd\u011blena do 3 f\u00e1z\u00ed:<span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Phase_1\"><\/span>F\u00e1ze 1<span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p><strong>Adapt\u00e9r vlaku a obrazov\u00e1 hlava<\/strong> vytv\u00e1\u0159et spojen\u00ed mezi jazykov\u00fdmi a vizu\u00e1ln\u00edmi prvky v prostoru vlo\u017een\u00ed, co\u017e umo\u017e\u0148uje LLM porozum\u011bt entit\u00e1m v obraze a m\u00edt po\u010d\u00e1te\u010dn\u00ed schopnosti vizu\u00e1ln\u00edho generov\u00e1n\u00ed;<\/p>\n\n\n\n<p>Pro multimod\u00e1ln\u00ed porozum\u011bn\u00ed pou\u017eijte 1,25 milionu dat p\u00e1rov\u00fdch popisk\u016f obr\u00e1zk\u016f a text\u016f z SHareGPT4V ve form\u00e1tu: ;<\/p>\n\n\n\n<p>Pro vizu\u00e1ln\u00ed generov\u00e1n\u00ed pou\u017eijte 1,2 milionu vzork\u016f z ImageNet1k ve form\u00e1tu: ;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Phase_2\"><\/span>F\u00e1ze 2<span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p><strong>Jednotn\u00e9 p\u0159ed\u0161kolen\u00ed<\/strong>, s vyu\u017eit\u00edm multimod\u00e1ln\u00edho korpusu pro jednotn\u00fd p\u0159edv\u00fdcvik k u\u010den\u00ed multimod\u00e1ln\u00edho porozum\u011bn\u00ed a generov\u00e1n\u00ed. V t\u00e9to f\u00e1zi jsou pou\u017eita data prost\u00e9ho textu, data multimod\u00e1ln\u00edho porozum\u011bn\u00ed a data vizu\u00e1ln\u00edho generov\u00e1n\u00ed. Jednoduch\u00e9 tr\u00e9nov\u00e1n\u00ed vizu\u00e1ln\u00edho generov\u00e1n\u00ed pomoc\u00ed ImageNet-1k, po kter\u00e9m n\u00e1sleduje pou\u017eit\u00ed obecn\u00fdch textov\u011b-obr\u00e1zkov\u00fdch dat k pos\u00edlen\u00ed vizu\u00e1ln\u00edho generov\u00e1n\u00ed v otev\u0159en\u00e9 dom\u00e9n\u011b modelu;<\/p>\n\n\n\n<p>Textov\u00e1 data: p\u0159edtr\u00e9novan\u00fd korpus DeepSeek-LLM;<\/p>\n\n\n\n<p>Prokl\u00e1dan\u00e1 obrazov\u00e1 a textov\u00e1 data: Datov\u00e9 sady WikiHow a WIT;<\/p>\n\n\n\n<p>\u00dadaje v titulku obr\u00e1zku: Popis obr\u00e1zku podrobn\u011bji.: Obr\u00e1zky z v\u00edce zdroj\u016f a n\u011bkter\u00e9 z nich byly opat\u0159eny nov\u00fdmi titulky pomoc\u00ed multimod\u00e1ln\u00edch model\u016f s otev\u0159en\u00fdm zdrojem, p\u0159i\u010dem\u017e data byla form\u00e1tov\u00e1na jako dvojice ot\u00e1zek a odpov\u011bd\u00ed;<\/p>\n\n\n\n<p>Tabulkov\u00e1 a grafick\u00e1 data: odpov\u00eddaj\u00edc\u00ed tabulkov\u00e1 a grafick\u00e1 data z DeepSeek-VL ve form\u00e1tu ;<\/p>\n\n\n\n<p>Vizu\u00e1ln\u011b generovan\u00e1 data: dvojice obr\u00e1zek-titulek z v\u00edce soubor\u016f dat a 2 miliony intern\u00edch dat;<\/p>\n\n\n\n<p>B\u011bhem tr\u00e9ninku je n\u00e1hodn\u011b pou\u017eita pouze prvn\u00ed v\u011bta titulku s pravd\u011bpodobnost\u00ed 25%;<\/p>\n\n\n\n<p>Vzorky ze s\u00edt\u011b ImageNet se objevuj\u00ed pouze v po\u010d\u00e1te\u010dn\u00edch 120 tis\u00edc\u00edch kroc\u00edch tr\u00e9nov\u00e1n\u00ed, p\u0159i\u010dem\u017e obr\u00e1zky z jin\u00fdch datov\u00fdch sad se objevuj\u00ed v n\u00e1sleduj\u00edc\u00edch 60 tis\u00edc\u00edch kroc\u00edch;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Phase_3\"><\/span>F\u00e1ze 3<span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p><strong>Dola\u010fov\u00e1n\u00ed pod dohledem<\/strong>, kde jsou p\u0159edtr\u00e9novan\u00e9 modely dola\u010fov\u00e1ny pomoc\u00ed dat pro dola\u010fov\u00e1n\u00ed instrukc\u00ed, aby se zv\u00fd\u0161ila jejich schopnost sledovat instrukce a dialogy. Dola\u010fte v\u0161echny parametry krom\u011b generuj\u00edc\u00edho kod\u00e9ru. Maskov\u00e1n\u00ed syst\u00e9mov\u00fdch a u\u017eivatelsk\u00fdch podn\u011bt\u016f p\u0159i dohledu nad odpov\u011b\u010fmi. Aby bylo zaji\u0161t\u011bno, \u017ee Janus m\u00e1 zb\u011bhlost v multimod\u00e1ln\u00edm porozum\u011bn\u00ed i generov\u00e1n\u00ed, nejsou modely dola\u010fov\u00e1ny zvl\u00e1\u0161\u0165 pro konkr\u00e9tn\u00ed \u00falohy. M\u00edsto toho pou\u017e\u00edv\u00e1me kombinaci dat z dialog\u016f pouze s textem, dat z multimod\u00e1ln\u00edho porozum\u011bn\u00ed a dat z vizu\u00e1ln\u00edho generov\u00e1n\u00ed, abychom zajistili univerz\u00e1lnost v r\u016fzn\u00fdch sc\u00e9n\u00e1\u0159\u00edch;<\/p>\n\n\n\n<p>Porozum\u011bn\u00ed textu: pou\u017e\u00edv\u00e1 \u00fadaje z konkr\u00e9tn\u00edch zdroj\u016f;<\/p>\n\n\n\n<p>Multimod\u00e1ln\u00ed porozum\u011bn\u00ed: vyu\u017eit\u00ed dat z v\u00edce zdroj\u016f pro vylad\u011bn\u00ed v\u00fduky;<\/p>\n\n\n\n<p>Vizu\u00e1ln\u00ed generov\u00e1n\u00ed: pou\u017eit\u00ed podmno\u017einy dvojic obr\u00e1zek-text z n\u011bkter\u00fdch datov\u00fdch soubor\u016f f\u00e1ze II a 4 milion\u016f intern\u00edch dat;<\/p>\n\n\n\n<p>Form\u00e1t dat je n\u00e1sleduj\u00edc\u00ed: U\u017eivatel: \\n Asistent: ;<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/rmy9ct2fln.feishu.cn\/space\/api\/box\/stream\/download\/asynccode\/?code=M2I3MWQ5MjQyNTM5NjIyZTkyMjdlODgwMDg5NzIwYzJfSGVTUnVzb0I3bEREQXBkMEJGN0lqT0JBaEVUWEQwS05fVG9rZW46Vm9OMWJzYnNsbzRGR1R4YlJrNWNad1psblhjXzE3MzgyNDIwMzc6MTczODI0NTYzN19WNA\" alt=\"\"\/><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Training_Objectives\"><\/span>C\u00edle \u0161kolen\u00ed<span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p>Janus je autoregresn\u00ed model vycvi\u010den\u00fd pomoc\u00ed ztr\u00e1tov\u00e9 funkce k\u0159\u00ed\u017eov\u00e9 entropie, pro \u00falohy porozum\u011bn\u00ed prost\u00e9mu textu a multimod\u00e1ln\u00edmu porozum\u011bn\u00ed se ztr\u00e1ta po\u010d\u00edt\u00e1 v sekvenci textu. Pro \u00falohy vizu\u00e1ln\u00edho generov\u00e1n\u00ed se ztr\u00e1ta po\u010d\u00edt\u00e1 pouze na sekvenci obr\u00e1zk\u016f. Aby byl n\u00e1vrh jednoduch\u00fd, nejsou r\u016fzn\u00fdm \u00faloh\u00e1m p\u0159i\u0159azeny r\u016fzn\u00e9 ztr\u00e1tov\u00e9 v\u00e1hy.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Reasoning\"><\/span>Zd\u016fvodn\u011bn\u00ed<span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p>P\u0159i pou\u017eit\u00ed dal\u0161\u00ed metody predikce lexik\u00e1ln\u00edch prvk\u016f pro porozum\u011bn\u00ed prost\u00e9mu textu a multimod\u00e1ln\u00edmu porozum\u011bn\u00ed se lexik\u00e1ln\u00ed prvky postupn\u011b vyb\u00edraj\u00ed z predik\u010dn\u00edho rozd\u011blen\u00ed. Pro generov\u00e1n\u00ed obr\u00e1zk\u016f se pou\u017e\u00edv\u00e1 bootstrap bez klasifik\u00e1toru.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Possible_extensions\"><\/span>Mo\u017en\u00e1 roz\u0161\u00ed\u0159en\u00ed<span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p>Pro multimod\u00e1ln\u00ed porozum\u011bn\u00ed by bylo mo\u017en\u00e9 1) zvolit siln\u011bj\u0161\u00ed vizu\u00e1ln\u00ed k\u00f3dov\u00e1n\u00ed a 2) pou\u017e\u00edt dynamick\u00e9 techniky s vysok\u00fdm rozli\u0161en\u00edm;<\/p>\n\n\n\n<p>Pro generov\u00e1n\u00ed vid\u011bn\u00ed by bylo mo\u017en\u00e9 zvolit 1) jemn\u011bj\u0161\u00ed k\u00f3dova\u010de, 2) pou\u017eit\u00ed ztr\u00e1tov\u00fdch funkc\u00ed speci\u00e1ln\u011b navr\u017een\u00fdch pro generov\u00e1n\u00ed vid\u011bn\u00ed a 3) kombinaci kauz\u00e1ln\u00ed pozornosti a paraleln\u00edch metod;<\/p>\n\n\n\n<p>V\u00edce modalit s mo\u017enost\u00ed integrace 3D mra\u010den bod\u016f, haptiky, EEG a dal\u0161\u00edch vstup\u016f pro ztr\u00e1tov\u00e9 modality;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Janus-Pro_Upgrade\"><\/span><a href=\"https:\/\/huggingface.co\/deepseek-ai\/Janus-Pro-7B\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">Upgrade Janus-Pro<\/a><span class=\"ez-toc-section-end\"><\/span><\/h2>\n\n\n\n<p>Vzhledem k omezen\u00fdm tr\u00e9novac\u00edm dat\u016fm a relativn\u011b mal\u00e9 kapacit\u011b modelu (1B) m\u00e1 Janus v n\u011bkter\u00fdch ohledech nedostatky, jako je \u0161patn\u00e1 reprezentace generov\u00e1n\u00ed obr\u00e1zk\u016f p\u0159i kr\u00e1tk\u00fdch nar\u00e1\u017ek\u00e1ch a nekonzistentn\u00ed kvalita generov\u00e1n\u00ed textu na obr\u00e1zek.Architektura Janus-Pro je stejn\u00e1 jako architektura Janus, co\u017e je vid\u011bt na obr\u00e1zku n\u00ed\u017ee:<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/rmy9ct2fln.feishu.cn\/space\/api\/box\/stream\/download\/asynccode\/?code=NDY0ZWM0NTJiOTNlYTE4MWI4NmMwNGE4Mjc3NmYyMDJfc1FEMHVOMHo1OUM0ZVhoakJtU1lZQXdZNTd4NVFXRzhfVG9rZW46RjJrTGI3VVlqb0IxS3N4aHVVN2NxUWxJbnZkXzE3MzgyNDIwMzc6MTczODI0NTYzN19WNA\" alt=\"\"\/><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Main_Improvements\"><\/span>Hlavn\u00ed vylep\u0161en\u00ed<span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<h4 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Training_Strategy\"><\/span>Strategie \u0161kolen\u00ed<span class=\"ez-toc-section-end\"><\/span><\/h4>\n\n\n\n<p>F\u00e1ze 1: Zvy\u0161te po\u010det krok\u016f tr\u00e9nov\u00e1n\u00ed a pln\u011b tr\u00e9nujte na s\u00edti ImageNet;<\/p>\n\n\n\n<p>F\u00e1ze 2: Ji\u017e nepou\u017e\u00edv\u00e1me ImageNet, pro tr\u00e9nov\u00e1n\u00ed pou\u017e\u00edv\u00e1me p\u0159\u00edmo b\u011b\u017en\u00e1 textov\u011b-obr\u00e1zkov\u00e1 data;<\/p>\n\n\n\n<p>F\u00e1ze 3: Upravte pom\u011bry soubor\u016f dat v procesu jemn\u00e9ho dolad\u011bn\u00ed zm\u011bnou pom\u011bru multimod\u00e1ln\u00edch dat, prost\u00fdch textov\u00fdch dat a textov\u00fdch dat k obrazov\u00fdm dat\u016fm ze 7:3:10 na 5:1:4;<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Data_Scale\"><\/span>M\u011b\u0159\u00edtko dat<span class=\"ez-toc-section-end\"><\/span><\/h4>\n\n\n\n<p>Multimod\u00e1ln\u00ed porozum\u011bn\u00ed<\/p>\n\n\n\n<p>F\u00e1ze 2: P\u0159id\u00e1n\u00ed 90 milion\u016f vzork\u016f, v\u010detn\u011b YFCC pro popisky obr\u00e1zk\u016f a Doc-matrix pro porozum\u011bn\u00ed dokument\u016fm s tabulkami a grafy;<\/p>\n\n\n\n<p>F\u00e1ze 3: P\u0159id\u00e1n\u00ed dal\u0161\u00edch datov\u00fdch sad DeepSeek-VL2, nap\u0159\u00edklad porozum\u011bn\u00ed MEME;<\/p>\n\n\n\n<p>Vizu\u00e1ln\u00ed generov\u00e1n\u00ed: re\u00e1ln\u00e1 data mohou m\u00edt n\u00edzkou kvalitu, co\u017e vede k nestabiln\u00edmu generov\u00e1n\u00ed textu na obraz a \u0161patn\u00e9mu estetick\u00e9mu v\u00fdstupu, Janus-Pro pou\u017e\u00edv\u00e1 72 milion\u016f vzork\u016f syntetick\u00fdch estetick\u00fdch dat s jednotnou p\u0159edtr\u00e9ninkovou f\u00e1z\u00ed (f\u00e1ze 2) s pom\u011brem re\u00e1ln\u00fdch a syntetick\u00fdch dat 1:1;<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Model_Scale\"><\/span>Modelov\u00e9 m\u011b\u0159\u00edtko<span class=\"ez-toc-section-end\"><\/span><\/h4>\n\n\n\n<p>\u0160k\u00e1lov\u00e1n\u00ed parametr\u016f modelu na stupnici 7 miliard parametr\u016f;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Experimental_details\"><\/span>Podrobnosti o experimentu<span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p>V porovn\u00e1n\u00ed se syst\u00e9mem Janus jsou detaily experiment\u016f s Janus-Pro v podstat\u011b stejn\u00e9. Na rozd\u00edl od modelu s v\u011bt\u0161\u00edmi parametry bylo pou\u017eito v\u00edce uzl\u016f clusteru (16 a\u017e 32).<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/rmy9ct2fln.feishu.cn\/space\/api\/box\/stream\/download\/asynccode\/?code=NDM1YTM1ZDliNDUwYzAzNzg4MTNiNjUzYWZlZjVhZjhfZGI5ZWloREhYV29OZUxiaEVFc0dhN1dMTDhGdG5ZSnNfVG9rZW46STA0amJtbVlhb0NySk94NkRKNmNqNDVybmdiXzE3MzgyNDIwMzc6MTczODI0NTYzN19WNA\" alt=\"\"\/><\/figure>\n\n\n\n<p>Janus-Pro tr\u00e9ninkov\u00e9 hyperparametry<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><span class=\"ez-toc-section\" id=\"Insufficient\"><\/span>Nedostate\u010dn\u00e9<span class=\"ez-toc-section-end\"><\/span><\/h3>\n\n\n\n<p>Pro multimod\u00e1ln\u00ed porozum\u011bn\u00ed je vstupn\u00ed rozli\u0161en\u00ed omezeno na 384\u00d7384, co\u017e ovliv\u0148uje v\u00fdkon p\u0159i jemn\u00fdch vizu\u00e1ln\u00edch \u00faloh\u00e1ch. P\u0159i generov\u00e1n\u00ed textu na obr\u00e1zek m\u00e1 n\u00edzk\u00e9 rozli\u0161en\u00ed za n\u00e1sledek nedostatek detail\u016f v generovan\u00fdch v\u00fdsledc\u00edch.<\/p>","protected":false},"excerpt":{"rendered":"<p>Poselstv\u00ed dom\u016f: Janus je jednoduch\u00fd, jednotn\u00fd a roz\u0161i\u0159iteln\u00fd model multimod\u00e1ln\u00edho porozum\u011bn\u00ed a generov\u00e1n\u00ed, kter\u00fd odd\u011bluje multimod\u00e1ln\u00ed porozum\u011bn\u00ed a generovan\u00e9 vizu\u00e1ln\u00ed k\u00f3dov\u00e1n\u00ed, \u010d\u00edm\u017e zm\u00edr\u0148uje potenci\u00e1ln\u00ed konflikty mezi t\u011bmito dv\u011bma \u00fakoly. V budoucnu jej lze roz\u0161\u00ed\u0159it o dal\u0161\u00ed vstupn\u00ed modality. Janus-Pro stav\u00ed na tomto z\u00e1kladu t\u00edm, \u017ee optimalizuje strategii tr\u00e9ninku (v\u010detn\u011b zvy\u0161ov\u00e1n\u00ed...<\/p>","protected":false},"author":2,"featured_media":684,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_kadence_starter_templates_imported_post":false,"_kad_post_transparent":"","_kad_post_title":"","_kad_post_layout":"","_kad_post_sidebar_id":"","_kad_post_content_style":"","_kad_post_vertical_padding":"","_kad_post_feature":"","_kad_post_feature_position":"","_kad_post_header":false,"_kad_post_footer":false,"footnotes":""},"categories":[1],"tags":[],"class_list":["post-746","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/janusai.pro\/cs\/wp-json\/wp\/v2\/posts\/746","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/janusai.pro\/cs\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/janusai.pro\/cs\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/janusai.pro\/cs\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/janusai.pro\/cs\/wp-json\/wp\/v2\/comments?post=746"}],"version-history":[{"count":1,"href":"https:\/\/janusai.pro\/cs\/wp-json\/wp\/v2\/posts\/746\/revisions"}],"predecessor-version":[{"id":747,"href":"https:\/\/janusai.pro\/cs\/wp-json\/wp\/v2\/posts\/746\/revisions\/747"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/janusai.pro\/cs\/wp-json\/wp\/v2\/media\/684"}],"wp:attachment":[{"href":"https:\/\/janusai.pro\/cs\/wp-json\/wp\/v2\/media?parent=746"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/janusai.pro\/cs\/wp-json\/wp\/v2\/categories?post=746"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/janusai.pro\/cs\/wp-json\/wp\/v2\/tags?post=746"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}