Assessing ChatGPT’s Performance on Software Engineering Test Questions

Estimated reading time: 4 minutes

With the advent of widely accessible AI technologies such as Large Language Models (LLMs), there is some worry about job candidates using the tools to finesse the interview process. We assessed the strength of ChatGPT, a popular LLM, on engineering test questions. Here, we share the results of our experiments, along with recommendations for hiring teams.

Assessment Goals

Our main goals were two-fold:

  1. Determine if ChatGPT can accurately and convincingly answer our candidate examination questions.
  2. Identify which questions are more or less prone to being solved by an AI.

Our investigation aimed to evaluate the effectiveness of our interview process in distinguishing qualified candidates from those using AI tools.

More broadly, we asked ourselves whether this distinction mattered. How trivial was it to answer the test questions using Chat-GPT? Perhaps a candidate capable of using such tools effectively in a job application context was equally capable of applying those skills to solve problems on the job. Indeed, many developers experience increased productivity with the use of AI assistants, and it seems that the broad use of LLMs is an emerging trend to embrace rather than to discourage.


We evaluated ChatGPT versions 3.5 (Default as per GPT-pro’s designation, not the Legacy version) and 4. The AI model aided a human experimenter, acting as a candidate, to attempt several assessments, including backend, frontend, full stack, and project management scenarios. Questions covered coding, problem-solving, and communication skills. We also tested questions in several domains from a popular problem-solving website, in the domains of Algorithms, Databases, and Data Structures. The experimenter relied primarily on AI responses, with minimal adjustments and no code writing.


Backend and frontend results demonstrated that ChatGPT generated responses that met requirements and passed test cases. On longer coding questions, some manual intervention was required to ensure solutions aligned with provided sample code and ordering corresponded to the template, and occasionally the AI failed to generate the correct code in time.

ChatGPT excelled at explanatory questions, multiple-choice, as well as “easy” to “medium” programming questions in most domains, i.e., scenarios where internet search made answers easily available. This is somewhat inevitable as most useful interview questions are believable scenarios that an employee is likely to encounter on the job. With “harder” problems, the solutions are harder to come by through search, and without some skill on the part of the user, additional prompting is of little avail.

Difficulty in generating correct responses varied across problem domains. Frontend coding questions proved to be the most challenging without the “smart” use of ChatGPT, and they were also more time-consuming to complete using the tool. This was due to the multiple additional elements involved in frontend problem descriptions as well as tooling, including visual depictions of the desired results.

The feasibility of AI-assisted test completion depended on time limitations, the need for prompt refinement, and experience using ChatGPT (our experimenter’s strategy improved with time across questions). Where problem statements were less straightforward, generating adequate prompts benefitted from English communication skills, an understanding of code, and knowledge of the browser dev tools for text-based questions that couldn’t be copy-pasted via the user interface (although typing out problem statements in full, while time-consuming, was also possible within our given time limitations).

Recommendations for Question Design

Based on our findings, we recommend implementing the following strategies to increase complexity and deter reliance on AI.

Introduce diagrams

Incorporating diagrams that are not explicitly described in the text can add an additional layer of complexity and reduce the reliance on AI-generated answers. Anything preventing the direct copy-pasting of question text can be a partial deterrent. However, it is crucial to consider accessibility requirements when using visual elements.

Use complex directory and dependency structures

Designing coding questions involving intricate directory structures and dependencies makes it harder for AI-reliant candidates to provide accurate solutions, as it is more time-consuming to parse the provided information into prompts consumable by the LLM.

Provide excessive information

Offering more information than necessary and making the format inaccessible to AI tools, such as by requiring the use of lengthy APIs, can further differentiate skilled candidates from those relying on AI assistance.

For non-coding questions, such as long-form or multiple-choice, avoid the test format altogether

Prefer a face-to-face format where interviewers can witness the candidate’s thought process.

Photo by Dylan Gillis

Importance of Follow-up Interviews

Face-to-face interviews are crucial to assess candidates’ understanding of the code they submit. Follow-up helps filter out candidates who simply use AI tools and lack a comprehensive understanding of the material. Descriptive problems made trivial by ChatGPT should be presented as face-to-face interview questions instead.


ChatGPT does well with easy-to-medium-level engineering test questions given sufficient time to generate appropriate prompts. However, relying solely on AI-generated answers may not be significantly more efficient than answering honestly with traditional internet research. Follow-up interviews that delve into the candidate’s code understanding can help differentiate candidates.

Since GPT can quickly generate answers given the correct prompt, it can produce code that an inexperienced candidate could not otherwise manage within previously benchmarked time limits. However, as question complexity increases, exclusive reliance on ChatGPT becomes time-consuming. Candidates must invest time in parsing relevant information to construct prompts, waiting for GPT to generate answers (especially with GPT-4, which is slower), and debugging the AI’s responses. Thus, the cost of using ChatGPT as a crutch increases with question complexity and more clever strategies are required. If a candidate is able to solve senior-level interview questions, with or without the aid of AI, we want to talk to them!

While we have outlined some strategies to help prevent the selection of underqualified candidates, we recognize the value of LLM skills and are exploring ways to integrate tools like GPT into our workflows to enhance our teams’ capabilities and deliver better products for our clients.

By Mona Ghassemi

Software Developer/DevOps, AlleyCorpNord