Evaluation of GPT and Open Source models on code transfer tasks

Authors:
(1) Bo Wang, Beijing Jiaotong University, Beijing, China ([email protected]));
(2) Mingda Chen, Beijing Jiaotong University, Beijing, China ([email protected]));
(3) Youfang Lin, Beijing Jiaotong University, Beijing, China ([email protected]));
(4) Mike Papadakis, University of Luxembourg, Luxembourg ([email protected]));
(5) Jie M. Zhang, King's College of London, London, United Kingdom ([email protected]).
Ties
Abstract and 1 introduction
2 context and related work
3 Study design
3.1 Overview and research questions
3.2 Data sets
3.3 Generation of mutations via LLM
3.4 Evaluation metrics
3.5 Experience parameters
4 evaluation results
4.1 RQ1: cost and conviviality performance
4.2 RQ2: similarity of behavior
4.3 RQ3: Impacts of the various prompts
4.4 RQ4: Impacts of different LLM
4.5 RQ5: Deep causes and non -compile mutations error
5 discussion
5.1 Sensitivity to chosen experience parameters
5.2 Implications
5.3 Validity threats
6 Conclusion and references
4.4 RQ4: Impacts of different LLM
To respond to this RQ, we add two additional llms, GPT-4 and Starchat16b, and compare their results with the two LLMS, GPT-3.5 and LLAMA-13B code by default. The right half of Table 7 shows comparative results on the models using the default prompt. We observe that the LLM of closed source generally outpensions others in most of the measures. The GPT-3.5 excels in the number of mutations, the cost of generation by 1K mutations and the average generation time, ideal for quickly generating many mutations. The GPT-4 leads in all usability measures and behavioral similarity measures, demonstrating its effectiveness in code-related tasks, although its improvements on GPT-3.5 in behavior measurements are trivial. Between the two open-source llms, despite the Starchat-𝛽-16b having more parameters, Codellama-13b surpasses in all measures. This suggests that the architecture of the model and the quality of the training data have a significant impact on performance beyond the number of parameters.
4.5 RQ5: Deep causes and non -compile mutations error
Non -compile mutations require a compilation step to filter, which translates into wasted calculation resources. As mentioned in section 4.1, the LLMS generate a significant number of non -compile mutations. This RQ analyzes the types of errors and the potential deep causes of these non -compile mutations. After setting the previous steps, we first sample 384 mutations in the non-compilation of GPT-3.5 outputs, ensuring that the level of confidence is 95% and the margin of error is 5%. According to manual analysis of these non -compile mutations, we have identified 9 distinct types of error, as shown in Table 8.
Showed like Table 8, the most common type of error, the use of unknown methods represents 27.34% of the total errors, revealing the question of the hallucination of generative models [30]. The structural destruction of the code is the second most common error, representing 22.92%, indicating that the guarantee of the codes generated is syntactically correct remains a challenge for current LLM. This result suggests that there is still an important place to improve current LLM.
To analyze which types of code are subject to provoking LLMs to generate non-compile mutations, we have examined the code locations of all the non-compile mutations generated by GPT-3.5, Codellama, Leam and 𝜇bert in section 4.1, as shown in Figure 3. In particular, there are more than 30% of the mutations that produce at the location with a member. This is potentially caused by the inherent complexity of these operations, which often involve multiple outbuildings and references. If a required method or member is missing or incorrectly specified, it can easily lead to non -compile mutations. Errors highlight the need for a better generation of additional changes, ensuring that the method calls and the references of the members align with the structure of the planned program. In addition, we inspect the deletion mutations rejected by the compiler and note that for GPT-3.5, Codellama, Leam, 𝜇bert and Major, these changes represent 7.1%, 0.2%, 45.3%, 0.14%and 14.4%of all their non-compile mutations, respectively. Thus, for LLM, deletion is not the main reason for non-compilation.
This paper is