- For more benchmark detail, please check π HERE π -

From 5fe541d0323c883ee0df8bff0bb9a1699ecc5439 Mon Sep 17 00:00:00 2001
From: nmd2k CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities of CodeLLMs A Multi-Task Benchmark for Assessing Code Understanding Capabilities of CodeLLMs
- Recent advancements in Code Large Language Models (CodeLLMs) have predominantly focused on open-ended code generation tasks, often neglecting the critical aspect of code understanding and comprehension. To bridge this gap, we present CodeMMLU, a comprehensive multiple-choice question-answer benchmark designed to evaluate the depth of software and code understanding in LLMs. CodeMMLU includes over 10,000 questions sourced from diverse domains, encompassing tasks such as code analysis, defect detection, and software engineering principles across multiple programming languages. Unlike traditional benchmarks, CodeMMLU assesses modelsβ ability to reason about code rather than merely generate it, providing deeper insights into their grasp of complex software concepts and systems. Our extensive evaluation reveals that even state-of-the-art models face significant challenges with CodeMMLU, highlighting deficiencies in comprehension beyond code generation. By underscoring the crucial relationship between code understanding and effective generation, CodeMMLU serves as a vital resource for advancing AI-assisted software development, ultimately aiming to create more reliable and capable coding assistants.
+ Recent advancements in Code Large Language Models (CodeLLMs) have predominantly focused on open-ended code generation tasks, often neglecting the critical aspect of code understanding and comprehension. To bridge this gap, we present CodeMMLU, a comprehensive multiple-choice question-answer benchmark designed to evaluate the depth of software and code understanding in LLMs. CodeMMLU includes over 10,000 questions sourced from diverse domains, encompassing tasks such as code analysis, defect detection, and software engineering principles across multiple programming languages. Unlike traditional benchmarks, CodeMMLU assesses models' ability to reason about code rather than merely generate it, providing deeper insights into their grasp of complex software concepts and systems. Our extensive evaluation reveals that even state-of-the-art models face significant challenges with CodeMMLU, highlighting deficiencies in comprehension beyond code generation. By underscoring the crucial relationship between code understanding and effective generation, CodeMMLU serves as a vital resource for advancing AI-assisted software development, ultimately aiming to create more reliable and capable coding assistants.
CodeMMLU Leaderboard
-
alt="blog"
class="img-fluid"
/>
-
id="Complete"
checked
/>
-
+
CodeMMLU: A Multi-Task Benchmark for As
CodeMMLU: A Multi-Task Benchmark for As
-
🤗 Dataset
@@ -192,9 +299,15 @@
CodeMMLU: A Multi-Task Benchmark for As
+
+ π Leaderboard
+
+
+
+
Abstract
Overview
@@ -282,251 +395,396 @@
CodeMMLU revealed significant performance differences across models, as shown in the table below. OpenAI's GPT-4o outperformed all models on CodeMMLU, demonstrating its quality across diverse tasks. Notably, despite not being the latest model, the instructed version of Meta-Llama-3-70B achieved the highest score among open-source models from 8 families. While LLMs perform well on knowledge-based tasks, they struggle with real-world problems, particularly in defect detection tasks.
- | Model name | -Size (B) | -Syntactic knowledge | -Semantic knowledge | -Real-world tasks | -CodeMMLU | -
---|---|---|---|---|---|---|
Closed-source models | -||||||
Anthropic | -Claude-3-sonnet@20240229 | -- | -67.22 | -66.08 | -38.26 | -53.97 | -
OpenAI | -GPT-4o-2024-05-13 | -- | -60.41 | -57.82 | -77.18 | -67.0 | -
GPT-3.5-turbo-0613 | -- | -61.68 | -53.64 | -45.26 | -51.7 | -|
Open-source models | -||||||
Meta Llama | -CodeLlama-34b-Instruct-hf | -34 | -56.81 | -46.93 | -23.55 | -38.73 | -
Meta-Llama-3-70B | -70 | -63.38 | -57.64 | -35.29 | -48.98 | -|
Meta-Llama-3-70B-Instruct | -70 | -64.90 | -62.96 | -60.84 | -62.45 | -|
Meta-Llama-3.1-70B | -70 | -64.09 | -59.00 | -8.22 | -37.56 | -|
Meta-Llama-3.1-70B-Instruct | -70 | -64.42 | -62.25 | -56.11 | -60 | -|
Mistral | -Mistral-7B-Instruct-v0.3 | -7 | -54.42 | -51.25 | -31.85 | -43.33 | -
Mixtral-8x7B-Instruct-v0.1 | -46.7 | -61.17 | -54.89 | -24.90 | -42.96 | -|
Codestral-22B-v0.1 | -22 | -60.34 | -52.11 | -37.86 | -47.6 | -|
Phi | -Phi-3-medium-128k-instruct | -14 | -58.54 | -54.56 | -37.89 | -48.03 | -
Phi-3-mini-128k-instruct | -3.8 | -53.01 | -48.65 | -22.36 | -37.93 | -|
Qwen | -Qwen2-57B-A14B-Instruct | -57 | -61.34 | -57.48 | -30.48 | -46.34 | -
CodeQwen1.5-7B-Chat | -7 | -49.66 | -46.58 | -56.37 | -49.82 | -|
Yi | -Yi-1.5-34B-Chat | -34 | -58.32 | -55.59 | -40.27 | -49.39 | -
Yi-1.5-9B-Chat | -9 | -55.64 | -55.06 | -37.15 | -47.23 | -|
Deep Seek | -DeepSeek-coder-7b-instruct-v1.5 | -7 | -56.67 | -47.90 | -28.46 | -41.21 | -
DeepSeek-coder-33b-instruct | -33 | -53.65 | -46.11 | -21.47 | -36.6 | -|
DeepSeek-moe-16b-chat | -16.4 | -31.74 | -35.43 | -27.33 | -31.01 | -|
DeepSeek-Coder-V2-Lite-Instruct | -16 | -59.91 | -54.76 | -33.62 | -46.51 | -|
InternLM | -InternLM2-5-20b-chat | -20 | -57.85 | -55.51 | -30.44 | -44.89 | -
StarCoder2 | -StarCoder2-15b-instruct-v0.1 | -15 | -56.58 | -49.07 | -42.79 | -47.94 | -
- For more benchmark detail, please check π HERE π -
- Two approaches are investigated to enhance the performance of generated code in terms of both functional correctness and dependency invocation. -
-
zceF|NGMe3vuh4Z@s%kGmBp-f@*$}(tpd~28@JuGV_mH`7%V1Op`R?$cj{m`wkh{Bw zE$q8fm#eN**>7)@ip>Ivy^3L~Y<-Do?kyPLC#l%;UVESp83*Kty&MOwdD1k5GSMuv zYIdAca5Zi$l5den2UV=gC+qt<2fgDLRAq~u0VJZC20>}0+!hne4zWBha8=~y$Hp+FS(X97Z{cQ{ zTqMBC&{4v0QXLwKN0{|i&7A~?mPAuirdm^$u;eNRg64SHNhAb>E_S!iTvUZFl6wX& zlNVBJq89U#?^>OwW$OHJW+ZjOYs?KE5r`29h3OYPM}R;@`Q{E6GSpdb?^&h7;lSFW zlfe17r>W3l-4l9tA0O zP@x9g5(%=3!gX(~x*ADXZ;v)A-I-7To;YeHnxAZx8mI)XP;{qAq~gspsd;67+=DKm z&GDGBN+8d@h9YK*W3?MY4R%W@;I#+jgTxw_M&NgCr$5mDNJ;3`>ZFwo9)orH6@n|w zCk#>6uD4-0-yU^ootQxpYhPI^7Zw_*v+`1ma||wtZeMV-2)97Jb1oI(Oj^C(EQnLS zE^$|>ozqs2lwHow&qH^vM8g(UAoW+zap{TN<0wjrOUt(&K41PGKA2jn(Y+~0IAsgN zLHny35#ud!u{uo&-;`Gult3^U4@qFFS8tAD~wlcnk8jU*aur4g9rAPqtT 0EACF-lG5-GX0b@Z}%Nk-cVSwM`3-mM=bZj{eeNXFV1AUBE8V(~csT4RvL z31o|C1wq3gaMuAB7D_w=%nYmsiii9mV9P`7+cH4F?xJLh$!W>&DIzWa!+1FFFbNzv zR@Rh_q_55qlqOZE%gv`sUrU&lIc3U`q*yPPa+W7*o3@gbrgXXxuix3ul SGuf^#un45&li3XsY{KPpgWjqtaB;VD(QWI=;7Jun6wB6Ixnj0 zWeCX6*Ow)dqiGNH6o78G7wbhWwEEXl52-?@-?PPh6qxIZGAOQLn#>C%qWF$|&^5^O z{g#5Wm{z++oG?GTO58L*%t&@$JtL9bXbIa~mcxyIdGwp4|LiHVgy?u2Qm$Za&0U!I z;yVUZloV@X(I8K%lB*Lq?@cqs*m-4XA|hk?^)plK_hQFmPO`o4)2K#Az9{JFap}2F zg8QcBQl)CigRK{?qP|%-4cOGS7>`bS=9%pyn733(r%ZDUBv}9Km%nTYOLq-NE!w9b z8SWA+u08i4eJ^mYEGLazXra<+9$J9T&@2*LS&%}g(n3WsjI`f1LR$No2AOM*xEkQf zB$Cv$m6P>vsJT`UWQi$$O}<`@8A+@-<&;m-wASMVjz2ifi{S^&wBHL|j1E0Zf>arz zrwSap@?Nr;bmkCD{45a%TP}@csOziHkTN~#M7e-cj>FoimbB^ HeP1SW=O%tTDCMW9|rD>OP+=HytsmZ}V zQ-hnqyGF)8Ms)p%i! TdOdEtM>$c7{iEG-bM!*k_CQfC? z*`W}H6;TW44ZouvsXQHf>ICxFD+OiAg`dRc8em1IK9?LH7ViUc#>RGHLH4PCzfgc$ zsd3aJ#rUf)9%p%s=6)g}(*Vz0?<~yBcYPX;VhaLeD=*!!Nly3a!ifgWbf;o#Ci>|z ztXN(3Iu^R$u+F;vfx!l;)-<{nJn%_Er}Dp)_O6nd2GUjZK4qg1Z^~(f%~7C<$Gc{% zylIp;9h+8f1oQ)5!}SAm?wvi4N{I3wNaMZU2(zaC54XTS#%u6PE(%kFWj#hGTI)_q zA&2Lep(brd3xk|Ly@%aBul6TJMh;pkE@U_) mf3#ng*hr`2%pl(Gj1`fo_k=g-Pc{w_BtLma z-`Y>_bIzB@VEK)A`a=f;WF19Qv`X^~@<(9FC#ye5G~fHv+(?~2)rVUH618qtIg!)c zY+ddyb$4{Wg6Y^pw4W;?Yof(q4`;&0h(5XbQ=MvUX?$yKYJbDa-(iC`Wmc_uzVEGC z oobb X! zC$Q^Lq{F(qQoU%LWY0O7eQ8-5_UciC9Q){a={R*N&w3u&LYWYq9o~wj1xxgKZ$Qe> zkV5tfj!iokJH684nh zif;+tygnpT1ript$kJvRa!_qQYAH$kacu2AhGmGze!Ns-!0anQx3-Vlk3_YY0Q{FV z9 ~Gd)j-@mCu8dB=+q7YGZJl`S-ZTzEW#tPh$qR5FQ7 zu}@{ZW{|xzFNwZnzH9R SUTi~6f&EG(N>-fXT=ZJRgVu4_^|5?^cF@Yd* zBCk%lLi8A4V={Y&fP;B&EV&>=bYPy3@5=D1UQ a%T>OWm+!&gs;+=NN535jH?NF(%-^Qog~b`Bs$f&?gY= zxt$QKFnG~JJ=}5T6>XjkN`I{8dqly>O%UZUZ&b~>u9olu2yo|ADy5NvDq@WDvvQPt z4jp6DQ8j1w#HZ%~xE+9m>0GpMD;~@oS4OuhuS)UW?w(8-Wm|ks5`nl@weJDB;K?A< zcOy_c{9O_}1Ntr?=Eit~EhZdFM2|P`&dbn;r;BCe?)G76`o%oMbeaD=l;pAKG&-2P z`ZEA-vC2Hx&PO^9BVhBhxUj+AAa2zFB6jZar;Pw}u y (>-_jLo(zeGGP63(FgGWurHZs59cj(rO7vU3_I6FpIz^UF7vz`+zQl1Lr9RQ`u zM30wq$0oDO y=txH za$36@2%7sb5=?LgJT_>DT5JiG(B&0N*=F9x=y2{<<+#Ttcu?QPgRJK-L0|I8j>q2P zr!4TseJQmEd%Xz7V>siW0t+;YAScj>QlwJtIF*R*kSR4*06xT3MrUI;>Du_!S|I5F z5VWK&lIgNogl9g&FZC}5<@>2T7a%m+Q>3#Ud1oj`E{gEihEY? 83w#!8Z8n*;}{j8j;hP285q4j#mWO z0em7=@N2#?gyTI2t+Hrs{|JZuFXCPG1tA-T98PmbOwS(N?dOy3at}})Hs9Wub^$;A zdcsH)?4|BGQ%@JxFU7#0gqn5lfr(`;rVL-fcz>rOy=gj;C^E;K#vYj_~i zlFdvJi_}65kVq0J9Pp6E*64wlRDG=CTOzfPlJ$5*v@q~cpjO?H2g>zRsq#58OQG8{ zQkOjs|K=!6gk$O;eJwPS8SG}gMlxTcGo7rg`9n^A@~C{(r7f1B09dj{+OOtWBp~y0 zIs53Nn~VP3^7><0SvAz*t$YTOXw4Il(rs?Yq4$ZVO=ruO@cnhuj~`ld9DdDa`sK9) zAkl)xl=QuGnXzfE6sqF2*vN>Xdu;0u FNOF>ML0 E~Z}j|`s|09a_Go4QV0pCPnlMzTaKII{*@o~2j{Img@i7gchdQ~8DM zR3jNHB)2ze)_#=pqx!MpdTmn)m^xJBeLO73Q(x2Up5k>16tlU`Ol_9H?go#(SpV}2 zMEi5J=U786mtTm!t1BMqg%zDZ=W+`hegpLta{Zy# zo=`$5s1%?05hOK{`1C{fh1Ox|9DJ}hndt0$R{C`%|uQ@xt9gtmVjLK{)BkA z3Jz5c_)EMJ(B=b`q@PWaLopN_TH5}((pwC2MjWrO)wV7fu1K)y3D}B9LF7c0m8HJ3 z+Ee7LJG$1J%#%_M{JH%EBzrVGTX~I&?36bhS>VXu)FOjj43`D{{(>F8&cNtd%mqc2 zE6qSJU%p&GRl3ciK}P4=Y%erZi8Td6wnQvFS@(EC7p99-;?@mgMQANWQqg|oo#-6> zMT${sBq=YpzgSFA6^o>stWv$`&*ucLo}`iy#tKsj6)HGH0V#iRNH6qWU>dXr>}>&A z^0hJ9`l;mTY(QodTAHW83h|>?W;tlM6o7;T0bLm;bP=|%Bsbq!pclYywX5kqR5zKG z>7fivgeo+J-5l7+3yBEg#(wAHS@`j|EPG^OpRigWleIB$^7^kM)gHbATyKTo?)*BB zrI{7B3UJLvU=YEvOV(77kaKr#Gs=@NS^yAAO@+QQu{etf9ia;-wUBbIfjPka07Bx2 z+Ec%jrD+#*&OD!h+k&-R@Tp`4NM16S24mtoILie9>6ALUSYTjyzeE2o0P0SFj8%Sm zOBHTDKnT115_Adh?_r30+)pbht0F1yk1|{&^Ao(2$HtVZv0DJtkp+?y#%!^5{sev) zb|>N39cGDF9stMdWK!<(0XM~l1G)CbfUL9I2pp>;szbE+Js=@;f`3YY9(qX<^Rws4 zYDb30H&lW91GZ6c7en1_XUW&T)TBu~XXHXMHF}6z&S*A>;WOsrK)7YsD2atMt!~TU zj1hglYM*mx?`0gV%0V$YvLVy^NnZBf#CCDT< 6VE?`Yp zuD}sEe>9f5tC5NYNj*P3vh@(ywBBxYwgI3O(q;y5y>zWT0+#$ehZ+8KYE3=u+T?Rl zTJPrLMA7l|n~IT_^oi1(#fjQxt?z!%++rkL=>{){K>^Ppvb)>;ghA`r%-CI8m4LqV zI5MWMa~P62E}!nKltNoCCS>XyAbT CdxLcueHsk$MS4hD;C0|q70Xr3x$_FX&)G& zH%3V`T|c5?( T5wn3y@2{C(7rSVXaM_-{NA8;4;4!CYy zTU*N@8AXfja|$G7`!O%L{^V-A?l}9QIGG1}9_#_Rxhzwy;J5(sTs}9?1@&r?rr~@` zIaia~hmkb_8380QY*~xSFJMSzwhCb|3ifTtp6Nd2$~NT7=R-ny|A;R}yVw2DyTo~X zgsgcdkhnOzoQC5{gD!ZcNvq%h;Bx0&lqr!XrvU))rhL3oClbrZxXwP70-(`rx6i)F z8KldeH0B!JWbUee#0Ak7rG-fL^r*1cl30u=%9`P@a-7Etkf4`7Bxhi;NyOY=SwmAn ziYJj`?GisbTx-znDVdD`7t!)8_ZzG9ObgYeRV2J;O;K}PKndHmttGWqDpO*0WruH3 zur>)8_h)`2GDPXAk$l+{b#_sg-cvMNo_!IVDmIAyf3^3OZ&9`FyMQAC5<_ lcZ=xcI8;@vqD5=RaK=9R&h=P8i7)qoLE2&OR6LMje8CX zoqaFgzZzt^ 5&Rjm z3xXr{y^uM5-vPZzZ9K@{mOctJ jXfx!EJs+LcY-^!Z6V;F{u zJF*RT0?y~kKYPoE dd66;|-2-fYsyLQW7C$DiQij#c`7`~+v ztJ#p2rxE2FcnNI=2tVXAC3vdlZkn2PK_9|+OF|(dL&FVNcthD@Jb06KdGWc ztPbi!CTv)7|LlKqy?e l)!+D7!#_!V8OY z0y4e$wZcTLYNGt%`kDr04?v+4b)bqB@&Yt aXQb<;O%i{VauiP)$~GjP^(55 z(NB1@pY_-tG9&;p&rM3|6pE%)AtF`yJvmpkOZjB?%Qf)7>n1x+Kr$_h428WEsnQ0@ z%SY}NB?_D$Le;~dksiTL(nB4zb3YCXSyqEJggWURuJAB%4sMH@oJc+E`|?kIfn3x_ zRh`x6xpea1sG~`=$lShBH3-U7C$B*q0HULuJ)j`Z?j8=|LX$O7XBowV`+ge|bx(TO zB9Ol#N%6($PVl|C7J!|{Du^Nfo+#3J>=x!|LxFq7hibohhb;@k#dhlOMv1n+ao|w( zNzjRu_Sagw5$!xZUwL @sr{(pb4KME(L zFVxPFejFRZZU$B3_gC+HOsp(d>x}bwmJ(-*$%iFKf;IEf@U;V*f(rfDUdWcjLf%#b zYaWUJWoF;?%iRFgkdIGjg=^noQ!`WleX7bF?FUi88Hpff@r%g~e@p@rRpMO{xYv66 z^BEva2M`)T_`Ug2Z5gBj=f5enb{ZM`>we=_jRo_=3(8Re?Aq>E!UFDV@*mm$JatO} z%LqNJf6?Y(SklP{q9QVx-*t{CM?l?@VXQ>Wefp=XP)A{|CUI{lIKtQr xa$iP5G$Plkwba>^a^Qs}I|$sImF zYrW+)>v=0;@KH5ci0*}eS{QtqEh2kF3y4uPz)@i*&l>v6>k^KE*j7HJu69*aYtXHL z#oVNdl&27A@=D>%LiJNg(i_}BiE?M;6Wf50g >G*E6?c^Fpvg9qeUohh{aOE^DNC;m~?UN2 ygpM_;l{qrXW7TS>S!)?J3V0J&m%Rg#`Iv1&QB`dAoZ)1 z8Zs4MHyh6GRqKCSNEnw8iWar^|7ap+Q$t^ L#xfP%Ew2e*MMCaK0-= zuWyt7{W@S4cZfT4v%jAGnS3MBZ$8{ILM;2%8~7s3_Hk`3zCEe87A$N9+A$Sud@AVU zNkBVCGm)0MnQ{LLsVnUwW mtnRQ8vR(vAN>Rw5pQ53-zX|fRggDX&|_w%gftXZ>lOq(o72xMEM7`Rv3BOkL~ z&DdML&Zv?Jj{YI1&Q+(5T->V3`PYUr-Jj8#u#1@Izqa%(z@}etq-^y6CTR1hKyRFa z2+Jne8>~s>98=lPlG(!^M#kt^>21#!x5F$twhwwYD|$C4<2Q*T+-H#W^PyZMx}sxM zNi`Qh*OD#x9kK2G;x 8?Ipka(!~UQs!#V0LB%R9eDNe!v=mlVO?tsBJ zi$@0YNUo38El~bp2?WngJzJPjvSv7QKqmAz=k_0a%96O}HoS+ U8WL)pY}tCHs>CWG__{q0j4GMpmOSv4m{_nN^_D=(R%I*gFkAPT&Ud#e_&@AW zAQzXfP<`t%hxSE5QZ`nvhVyH e5Es!J{b>z&vsC(2H}dEfnJ5l+#qVnA1RWtOEM1dH@+ zB8a=(l@Ex+w?Hl1tYZsdi`F0a)2bDvgG5fIPPf$V*8q-0ELf9Tkbs+`oS#7aK{bm4 z*Q)nB%2-DAJfBE-SOLq(lPx8`^` zwsL-3r-MSOn4i13S7sLIdWe
nW>yjlYBMGxH6vTFd~p3hXt|LwBt+ds^r17bswO zN{X~`mp3qfl^gM4ONJTSpm2Z<%3HLR2#y0R^XPj2jb(`z@skeEYmmWwI(O4_T{16S z0YGfGPh@xc`a3St;+y~ B z9O&yZiN7;wdb92ZyOt{IzGd*)%$wL_P-(1jTWD3c$Uy%%hmk){Fr3a!^%IxgUn!DF zr2&!uD$nYDfYH*l3+`BftMKyO;qtM`$sXatIQ=m6Kwb4Oh;zKx(?@_0)|6da;w2Cj zK4`rhui*NsuH+Q> @QI>{u1!DyLU9Ii9p?iWS{M^bQB@q3J $umR-S9P^UT; z&qATEK|Lh@wZMxnj{rRSgH?~kjj@RdysZWB1A-k%1#rmd0$MXUrYeey9X?vLV3+y1 zs7Uz%kdngi)qsc<8*JxRxm5CuDs--)emka%QIbq)g`K9z+GLF^aYNMzjjE@a&pF~r zpGfyYe~&zfioviV0kU^1r<1!`YtJa@M4ZxiB?s -v1*Y=q&DH)p za}tjtFCfOiogqat5`o;%Af5IZ*QdcRIlpm!d#k8+wceam{Lb _^%9ooo4& zWRq}6y@+) (ak{@~xd$+s@1El*2&@eV z*e=VUS`+CX%fG6RWe5dPs`Tr(2SqJ^1>)0Jg17>20i |SrrjX$^>&=Z z9@Q3N#(Rx W|41Bcw~1;_c)^B6lx+mRJXx+GF~m)(&}D z*7iyiej_6ZPQiQl$CgE#cCT GXHR?Cx0qD(CXhd-*El!ftYt}cBiVc6K&i( zrU1rd`*-z6wbOL<_m3-fXW%58ZBW2UTj`Ic@%lR>Cx-)M#FKW&HmxSRf%i@RKzH8^ z^h*{ZZ;)xF6ROAjz(9-}xmP#qtRCV0T7ifI0;Ze1PV^z2Ud2+=Um)0?#yB@qC}` f=sa3_|TI~1XdVpWhw&h65_pGS`0}xECOP@cT%U7uEBthbA}}C zA1l5Rgezd}gG0yD)Cyz2`RiBlGuUNIC~=^ZYKaTrM;U+ffc=<)$=>9}%i7Y|%}A@h zKgaTQR(D+9DVu}pqqeg^3n31dAo1J=};g@wMXj((?{R6aA6UGmqbXydVseD z!jD%1Kt* =6)R%Y&Vghpdq~E4%=M@J>m%*S<{*ZvG%% zH5t}Nwe}dr8K4;5nOYH@W9dEjDSrd;3WQm;qkX2*h3pe<2_>_G;5fJmn^_HGPN&Fs z^GG+UZEpsg#yu}6%NUT2=-^<{sDJ}cN%|f+r)}Z^i4qM@XBDiU^^;lSh`0(JK&5 zhZjA&?;bakYQ#4Lj_4<0g!B$~)^HOest;YTjnVm1HE>V~ch+8)s|3nU5nU-`RQuyL z<;d96EFOr^ U_U$FGTv-rz9;PIED;a9F@3h$BKlgL=P@N9b zFgE1?ds!(8tytrbXTy7oP0N;y8*zo^|9QR`Qm1&EZ+z|@N+%n5d)_1CWCW>`?fFr5 zU~yj}(Y4C@B`6f8MChhdj4;V1i2sHwAnrOA9*VoIsy?cQ7IxE61Us)~7bqIF`~K1l zdg|-MKed%ln00W6d|NO3W)YlKw6hWVceD>@mgNG&QNS!r#kuKFxe<`xrJgWEy1kmh zfyt5xJT-lwm^=;C)wCjVd!xrk@%W{2Z6Q>w8HmT=kh=@U;3LPm)N>%Dfw&_g4(})w z9(z0`%qi{%)SjhZh1v;%@w2^_aUTUT`X9%$W->YSRu=bl&@Ob1^}rd%--&YSF<6YE z%*AoPZbXjai>Otv$bJR0#}9-E>k1{IEh>f{Ksq$jl&kjjq7kMj{0i(WeZWP|0V%e9 zx~E)RdA2J|G6B%fRm!5Ye(s7UJtLKe*>499azh9co>=;i7uo$37Fvy@#&~HzmSa(Y zEPEz-xmh5>omjT+4*RsoNt`k)9N2*`8U;MPig|b0)l8XP!zHV7z!6amf(|?feOATW zZ=C_8W;a0i{iF5v`la{fR`Jn#!HJBz$MfHZky7uVgw?|~%m^HO@(qv+JpduQV;!zf z7Wu<4v;_vMxhC!=d2OGBR%B)WDP1vS*9`@m7EPGrsBi}W+$@5>@Q$kP%f}b5@eL8w zRwYw$ckr+1o;=R`S9A0llEC2YQECxyPk`t?csHd|&2@oPO+>*WJGG@RR)-W2{{~gw zX|sW~{9`Uvwnko$>H9TskJ6B&W+sQ)$!X9=HkGz|t;VZ3ct+zt!5|?WVeE3V^jf;R zLl&(5C5cw??ejx)A)8yH0ST-MT_Fl2JYzq}oTx&Y3PYcuZ4dZt8$9-OVNV0hX8Oo5 zR~pj=3KIkzdXgFgNUnY%Mhi78`V4?;rrvr{vOrz#SO<|z=6f-ByFP?cA%0cr7U7x& z#g0gJ3TJ_cMmhwKGjUbY*5GaAd=JxPS`~FsoXpMny5o1%D9*%&Sqmb$OV6C2wVoXW zqDC6-EES87X|TA6-&?ASiJJZ3tVBEH-G>pVA>n&_^WW+X-^4$-fASzFPe5w7vcu__ zD?6lMj-GL 5&15| zW#Pg@NapusJ2pdVkh!EL^6p!V5(3rp5K}{=?VWznH>U|M_fAtAj7S_+hPQWi){^wK zs$MDVCjO^Vh}Vc+$%-yc{q1198NzO0x1XNFaLsIs3xPbUqjR!K-?eO*Cr$fRMHk-L zuo!SPjp)*2R_gs#67ZQI6^f-wR1B48^5S^P3(82mJ`B@{3R1e~L)ZtuyF-YK6i=9B zSyS&2Ne<1J{?IG JPA97|)%O$Jq$8@B8n`Uv@{ ziKec*0e8Y)SUBCY4@wiM5I2u8Uov5~ZXy-Ro284$dmKw})duJah+lY!Nf274@OBoi zyd^3HloXNU@2)$)9`?{#v#yl%Ld|~kAI~^e7-N+~GCgC{!AZ~`^T_zkB|35tkOUr? z223xTa+zQkz&oKp%)9!V22pk>n0y)G(0?M&esLISl<`{pa0;1^q~*D-R3R>to1hhl zKiKX$jd=dx$ES;)rXL=`&sjPLxC?vcN}FWeJZ@l1a(;W=QzaXSdn4|uiH-cP4L*Pf zR_lvRFU|SJNnTfTz(i!|JNP-6>>?F8)>G{A4h J^6mPrMNI5LS=qM|>MmLGz>b%=N9I@(j(b#d>|YC|EPrBMiwMcRa|a~Ixj zy2h!s4hxE4ZlK_W@3ry+CK ?7g{Y8y9$SX#Y_M(jUuyvYMzpGj!KoFk*l=rwSjpC`MngPx54g zZW3kKH} `DDE~?Pw}@k(S)HPc`zDRv4Xr7(!oJptj=QfLSb((-w!P(p1Ku%AVae zLi)L77`}dY5Zugk_otL9%cV$$5Bd?aq+Y=puF z6@HdufmHsuAeY!P$39j0Z0B869@rUTF8SH*!ybxO_)~eh_aV*;W5#Cx?F2zJzke+@ zQekk3sfpTfjBA!z3MKt3jquKTQ@7GEq0yk*r@A<>Ga|b7E-=MPpfy6>x#X=5@p77M z3gz}>-kxA;z-A>sMVS_zg3G(tvk)Ltks_oe&Lb~bXv$9g`0JVfbLTIw=s!JQX!6c3 zYO~0t=(LNnCZJNRlDS|DDWiC4UKW9xO={4VcBc`M0=~m}|9%Rolw$%Ny_0L)vzN_6 zV b;5#^rUk;|Xv oF49OPGdV>!on`sTxn$s^2**pe+okSf$L^+zo z-07d}?xP^C9uK+Zb=0D1g?{|lN Lm#9c$pd zC(w1t!)0MLvQN)nPWgY^P+Ro keC z-K)%ivOh0h-DOXIiKqa0{Vsog>hNesC_=!1nXy5^2gk^Fj9t3H--w1! 4iHJ8A99lpyh&C`>|veUly|o z=@_ueXoO#rBHywE5 rS2GLKoaRI%c9lwcqwEDc*;I#QxJ>)hx~BVi+sf3Y zi4<^*kzt(|#N6s)4_5H4$LC#3&wE$Y_Y8KCaOVRWjeT;k1G#MqVEaNp44`&0yo?O0 zAPeq)9J$Q_KJ@7IK8c`v8RDtn!!k}&Lxy|zcR3;JJ=37*qW0dEA#&6%LKF3gYieuG z5WZ{%+@$Xiq? WGE=TjFbP!5>;s{#qkA_S_#u8sh>~?3n)4LEL zl4aW&I=0?b0Z@fLgFkDBF{7bwL{avmcG7watIxz-p!f1E@QL9Bm~wqHtVhhs`BL)3 zbM8IbQ!~0AD!Pa!udy(x0G~H;A$U1QvS=4 l}rpvf5+e>KRs>Jhvy6HPePcv?o8IVqD(ZYsV$0|EEuI7};u3eYFS)pr2S`tAd zb*y6w5mWyI>5QJ_;>3G%Ihtb IA{CESHS3nB1tgT`fDQHIEu}@8u-faP zM%rQu)8A8t;Wc?YR-_LaNEzB|11TV!$0?@H#QGv!#0jr;{&B7Ov9_yEZ-ss$6(a zEtNjJ+^P5pHXW-h$^7>Wrvh;1G8mbjoYZVn111fyLUHG11ICHP%pq!)Y$5lE>f-o6 z*nDiCUX{7Mc!$P<9$TXb_L*`*RNky8hKi>&3|7*Z++zr>B!9N=p_=k5Mx`|N=PDKQ zKWvo?UxO6E&JLm;!X|FO8RJCzoD)1IO#DRmyONT4jv-|LN>V6;gAzuF%06!@v8+1O zoz8aHF1bOe0&y}7!JGU{A)oE2-+LpA5;sM)3$gW-MfgHnp^Dm#;5Tg7z(K8tUbE9S z)tv2lmgnOMb^5K1%|#`}@BbnIaI2&eSVpP*HvQ{XQHB#8gRAR0HiZcfgWLw(pY43% z-yGh}bAzWMui3;5LXKB9%v>;to{{o3{0>fvFOhl@#Bteunf4vH%h7FT7h7v&E^!uT zlAO~#m5G@Vd+AA0E?mH-N&T&Xi0pXi8B;Dv-bL+;M+{i2o rayNj@G44cNh@7X- qR?jU)HTF?=lC~s7yhCEl1K#P0n?{`nQs;T2^#z+-5I&aQhL& zr9{?WJ -M_!03xUwP;J#il zr9db^xCB6ebR@ZmQNyFqkjHEYH5!z`I#hEbP&*>ZU=#hNj)an>zpSKH8Lp(350kVx zZ^UMLSk6~|rvkp&pCiSd*LE^#`Lf3o0F*Xh@utQ24;hmMVu-T9l{Xl<9*r3y6n|44 z0VnC1R@lHXE#+!hs;%@r8Urg7lO9tt!#|PWE47O*ncm4&h1%<-Kap^prNeg_HFXtG zCc_!vdi;6yX1sW*r63BD#|CvS&-FdMN22I?U#-~kk;QW*88de@U_-ET=#$MrDFuc- zZ2gIaj2P3|q@iEIX0EfLZFwtlfOvUzJ=EZbuYtwC5F;@nxAvmm#U#CaS4l{>j4Fw| z!q?4(=5Hr+oxX{0LY;C0R+ !WEF2`qTYc+_V7&hxsDBlNnBvz=}V6r;swm0yb-rq*pGtkn_@WS zpLJprh421krBj*d1zbHMk23RS?RpoE`0rd;We|oyvL15o0k@+34WE(6W?;*Yj&6-B z2hA14puaKKK-QJ|=s{yAMaZD*i#E$c{t$zAi6kTJuVmq}V|iMcPt)MNiqxbDT@N?E zjB0)42rmg1u$aq9{dYCEu^VgAP? z$0Y1j9t XzDxJM-2u`4lr;1l1@Z0* z!uHLZHRk$sIC9ly742ozlo%gSW$X}51PDKoM2_@opy7$UEXD~d=2_B7uxzl1JQb5v zp=Jb550z>vl};aLGidW6wt^oLw=caI`O<$gFWjjPcJ}&GS{%IVf{w?&pw&cB1T?U+ zr12b-MegA_eixZsOnMQFFZfm!`rHd0q?hOxeDa{GJ9{DGl}1`Cp_`mvG)}>bm7LMT zX<}CU8a-lYkhK5hlwYA-6K@t2*FrF!X|yMUj08?kyR8oILeTtR)RMXOKsO(;)*a}s zDuTbu6KRL<#|@%8wp26APGNeR{J>MFc9sF_d)Y$}P)sxRSbJn6I >Lu)CDa)}lo1M?i;(+G#njB)aO=q35dyUOO`^&<62hh$$oQ zu7Jib=@3m66t`8$Q;_#w{#(f5uA+KB`&+y10@N?6)YX--&N%=Ce o%O9!0FmGknQO0@7-30irYaj!!8-a {g!!Qg~Z2LOOE1&+dD*L+X* zBOxAAYTW=}S<^_tI1i^m`2&u(5o!eC8$8=q$hYW~M&<020$S9-sW5{VaqCpbFF{E~ ziA&h(9bitg5GtznTbadrxZ6x_4Hty}hac}~#B`t@dKas>7&I`EttMvrojrt?`TQUD z+?6>sq*q2@NM}uY-ni%Gn}E0;<%@3Gms86v03j|`4-5Ey_16f *y*J-V6gP?Zq)fncRmTqk z4iK|KwyI%J BuXgLVqY144|1W3oauC=vyfge~*MDvuX*c zhUtFfg~3(%oikvsbjRAzuKWaB--A@{>^GZ#`JW$z63ZH@DyIPHxakHp)=q&_+<*%x zT-Ft;_9`UBdJ@nfbkL`{N?4U1nCLADDbRV7 zNK1ZFioNhMDv4yNC~cD+y{^`Aa<$-B{b1rhXRRwqjYij1j>vwE=PnfXGlCbdc>TU2 zz9i#o7`?=+FUf*JP5G?P43+(~YX7O;f)xf{y#O?~kVH&EH&J+W+l1A(D`fa{Qi=DLYcdyhFjNWMwNd1PDckqzz-#l=#6Tb%7v=J_B};Q><% z(Y^P8ZDPDNw37z4FV@QXd&7Xo2Q;xdu?C zPs?ByI+dF-o;1aS3N0vEO^AQ<@_ZTu@dX{ZfCC**8pN2Ea~f^!|B)ij=kte=RuZey z-K0T%Fx!~r0$rk5;t=xXz-Gn%#fu1=m}9LfQef~ioxu0%n;$_B6_>{Ea}woObjt*- zpiJKNjB%j84(?Z>m= &V3cRyJHsH{L}l2=zpLhf-1ZFZsi#$fVN zNBDh!lOYrgid2I`&6_~oD5Db>fL`6o!ZD(btfC;|U+_2o23EIIFh<)(31jIEgQtZa zcF)#bk~#b0EGNm_M0#uzkTNp=$z7tz;Q9NL2y`&B{VDpV_4a0@NHfDJJo$Q~jE47$ zcj7J=Xidlnshqw>8|ecz!h<^K9lWb$$mk-Zd4=o_?&;QO&kW?^L7B20eS(L=W%3;8 zGU$Eg`3Joch=W@Eulm3zVh07yALssLjL!r!#XZXij@b$52zjk$AT 6gOW7vaOL z*kj>A`KNj4l@t9P_#m!=dSw0d^L%D@b|VIG`ZKNQa9X2%`GB0MSB<#Bc&s3KTc`lQ z05^V~R_LjQjg^G>PnO`i$bb2BES!|*-FuRI6bUaBg2~RhET}v?s5d~;72%WCc?v%5 z)aT(y^bSY@Ws*Aq{=^(l%srDmb|!d_If>a-BOth6=0OvsHsFaqZ4zS8AMrK*Pmvzr zynNVeDmMkwN|GgT`v743@o0DFBQXm?#TkWVc+jh_uQ|4L`$TokQu{?4gZG+alJ}|b zt{yw+)Bc~I;>ibY6WHXcu|JT>5)+XdDT;r{X32`D>H-?3O=wXk51@|Sj{gXsmFAqV zT^l9iB(^As%YYz3l?+SfdP{o^kQHiJlku|Btbza3`*#rP5Ni%{@(%<2G2)SLmvwiK zKoHn`N4y2hoFp C%_f8X%I7}+mHnTdPAyk^JC0Uby5`%sm!FxJrton$5P;pqTtz)G zF>z_4-AMOI1oZn+GY #zH!Up_nia zyt2dOEJ_?ZHrd&sxYv4y1-SE6S?Dt`2^%X^MrSFzj69uYTL&%|wf(@dMRF8&*FOxd z>a5p`uJhhKS(1+2rc9mzliPE{a~?f$Xt02yCWoMbUFrQd@HwhGL)C&~g2a;)=SkT# zDji00UBi-|m5l*+t5S)I9dx)i1ZBGF$7RoC0`1JwpE@dZMf4@Ogbx>})N?e@&g2%c z^g@F__;dngN2t+v`LPVZ6-q%ng7%jt8(0dn3Y%tq0?*5Vg{%A TiMY5taXgDs0QPh}r^<)@2dk@vnp#?QIWliv zJ&Dpzmlc$dm|b4Gv63<)CUo5JpNHTS-tSCanwc@FuX@Sz?sV)4{6<^OwH&9d0WX%j zqldxuF14U#&XLxt^{ vT86bQen E3=y}J8bavhw z-ly(X%x>VlygWHrU?L0sa1+Uyf)Rv=b?E2f@8A4w#OI`}_12CB-&(|ej?LW2$jjyK z`v#gADBU`B!Qu6oWjL4LSy?;=CYBmjK|#U%@|?E0KikSN7|g*zW%HZcmahKsE4HB~ z|EoG)& 1!Lf=uSJxmnz@fJHNz*p6_5C`n0dF+eNcZUrO$=4JqPu_YZx`+EY rqG?cx-pUB3 zSRHIktPU8Id!TdN*lCZJ@B5i&XH9zpNer$)6@KQ*yFjLY)=SSJIM7WcrtJBwtgPDc z@$tQ$^OHr%XV3cUUoL)So!awFUVfSVW~k>h#$`yNP#OzjktrLhtC2fBKmVtl|JCOR z)8cfMOcuM)fQ!E$GU&yHh1XtfN_{WK*$+CLMPFNi$vL^(5nn<=0{P}m>-FWy$N7Z? z>8ml~z}G*Ex_ GGbK;ne3WUnR0a@`;U%}*#FouI%ako z+BLDc);pOFMFM=xuFi*nCyf;G0cF zYSG=(^UYJ ({Z-!iGlLe~*AZ^2(cqVRCq^JDczz&~+Tad6t{R0_af1j_1S${Bre z?E|C62%^K3OIvwChQ1RA7ZXmjh}Javk|-7?9Cx}{dGwu>?n11FyE!w^mvn>t-%rWJ zGlQ@tMPD*5kg4BQ2oHTHGID6>|9|-Z|K|T+Pn+M2;aQ%de6KnGVSxYCl(ZDf<(~%s EFB{C19{>OV diff --git a/leaderboards/codemmlu/index.html b/leaderboards/codemmlu/index.html index 2fe6ac3..38b8eae 100644 --- a/leaderboards/codemmlu/index.html +++ b/leaderboards/codemmlu/index.html @@ -1,747 +1,21 @@ - - - - - - - - - - - - - - CodeMMLU Leaderboard - - - - - - - - - --CodeMMLU Leaderboard
-- -
A Multi-Task Benchmark for Assessing Code Understanding Capabilities of CodeLLMs
-
- - - -- - - --- - -- ------
--π Notes
----
-- - Evaluated using - CodeMMLU -
-- - Models are ranked according to Accuracy using greedy decoding. -
- - - -- - "Size" here is the amount of activated model weight during inference. -
--- -π€ More Leaderboards
- In addition to CodeMMLU leaderboards, it is recommended to - comprehensively understand LLM coding ability through a diverse set of - benchmarks and leaderboards, such as: ----
-- - RepoExec Leaderboard -
-- - Bigcode-Bench Leaderboard -
-- - EvalPlus Leaderboard -
-- - Big Code Models Leaderboard -
-- - Chatbot Arena Leaderboard -
-- - CrossCodeEval -
-- - ClassEval -
-- - CRUXEval -
-- - Code Lingua -
-- Evo-Eval
-- - HumanEval.jl - Julia version HumanEval with EvalPlus test - cases -
-- - InfiCoder-Eval -
-- - LiveCodeBench -
-- - NaturalCodeBench -
-- RepoBench
-- SWE-bench
--+π Acknowledgements
----
-- - We thank the EvalPlus and BigCode teams for providing the leaderboard template. -
-++CodeMMLU Leaderboard
+ + + + π Leaderboard + + ++\ No newline at end of file+- - - - \ No newline at end of file +