Utilizing Google Cloud Vertex AI Code Chat to automate programming take a look at scoring | by Cyrus Wong | Google Developer Specialists | Dec, 2023


Historically, educators create unit assessments to routinely rating college students’ programming duties. Nonetheless, the precondition for working unit assessments is that the mission codes have to be runnable or compiled with out errors. Subsequently, if college students can’t maintain the mission totally runnable, they’ll solely obtain a zero mark. That is undesirable, particularly within the programming sensible take a look at scenario. Even when college students submit partially right code statements, they need to earn some scores. In consequence, educators might want to overview all supply codes one after the other. This job could be very exhausting, time-consuming, and arduous to grade in a good and constant method.

For a programming take a look at, we offer starter code to college students. They’re required to learn the directions and write extra code to fulfill the necessities. We have now an ordinary reply already. We are going to retailer the query title, directions, starter code, reply and mark in an Excel sheet. This sheet can be used to immediate and rating pupil solutions.

1. Crafting the Chat Immediate:

  • Design a complete chat immediate that comes with important parts corresponding to “instruction”, “starter”, “reply”, “mark”, “student_answer”, and “student_commit”.
  • Make the most of “Run on Save” performance to encourage college students to commit their code recurrently upon saving. This serves as a dependable indicator of their lively engagement and trustworthy efforts.

2. Setting Up the LLM:

  • Create an LLM inside Vertex AI, particularly the codechat-bison mannequin.
  • Configure the temperature setting to a low worth since scoring doesn’t necessitate inventive responses.

3. Using PydanticOutputParser:

  • Make use of PydanticOutputParser to generate the specified output format directions and extract them right into a Python object.

4. Connecting the Elements:

  • Seamlessly combine all of the aforementioned elements to determine a easily functioning chain. This ensures environment friendly immediate administration and efficient LLM utilization.


from langchain.chat_models import ChatVertexAI
from langchain.prompts.chat import ChatPromptTemplate
import langchain
langchain.debug = False
from langchain.output_parsers import PydanticOutputParser
from langchain.pydantic_v1 import BaseModel, Area, validator
from langchain.prompts import PromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate

# Outline your required knowledge construction.
class ScoreResult(BaseModel):
rating: int = Area(description="Rating")
feedback: str = Area(description="Feedback")
calculation: str = Area(description="Calculation")

parser = PydanticOutputParser(pydantic_object=ScoreResult)

def score_answer(instruction, starter, reply, mark, student_answer, student_commit, temperature=0.1,prompt_file="grader_prompt.txt"):
template = "You're a Python programming teacher who grades pupil Python workout routines."
with open(prompt_file) as f:
grader_prompt = f.learn()

knowledge = {"instruction": instruction,
"starter": starter,
"reply": reply,
"mark": mark,
"student_answer": student_answer,
"student_commit": student_commit}

immediate = PromptTemplate(
template="You're a Python programming teacher who grades pupil Python workout routines.n{format_instructions}n",
partial_variables={"format_instructions": parser.get_format_instructions()},
system_message_prompt = SystemMessagePromptTemplate(immediate=immediate)
human_message_prompt = HumanMessagePromptTemplate(immediate=PromptTemplate(
input_variables=["instruction", "starter", "answer", "mark", "student_answer", "student_commit"],

chat_prompt = ChatPromptTemplate.from_messages([system_message_prompt, human_message_prompt])

llm = ChatVertexAI(
runnable = chat_prompt | llm | parser

# Get the consequence
knowledge = {"instruction": instruction,
"starter": starter,
"reply": reply,
"mark": mark,
"student_answer": student_answer,
"student_commit": student_commit}
output = runnable.invoke(knowledge)
return output

The output of parser.get_format_instructions() in system immediate.

The output needs to be formatted as a JSON occasion that conforms to the JSON schema under.

For example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a listing of strings", "sort": "array", "gadgets": {"sort": "string"}}}, "required": ["foo"]}
the item {"foo": ["bar", "baz"]} is a well-formatted occasion of the schema. The item {"properties": {"foo": ["bar", "baz"]}} isn't well-formatted.

Right here is the output schema:
{"properties": {"rating": {"title": "Rating", "description": "Rating", "sort": "integer"}, "feedback": {"title": "Feedback", "description": "Feedback", "sort": "string"}, "calculation": {"title": "Calculation", "description": "Calculation", "sort": "string"}}, "required": ["score", "comments", "calculation"]}


Programming query










Variety of instances code decide to GitHub: {student_commit}

Pupil add the code assertion from Starter.
Pupil follows the query so as to add extra code statements.

- If the content material of StudentAnswer is sort of the identical because the content material of Starter, rating is 0 and remark “Not tried”. Skip all different guidelines.
- The utmost rating of this query is {mark}.
- Evaluate the StudentAnswer and StandardAnswer line by line and Programming logic. Give 1 rating for every line of right code.
- Do not give rating to Code statements offered by the Starter.
- Consider each StandardAnswer and StudentAnswer for enter, print, and major operate line by line.
- Clarify your rating calculation.
- In case you are uncertain, don’t give a rating!
- Give feedback to the scholar.

The output have to be within the following JSON format:
"rating" : "...",
"feedback" : "...",
"calculation" : "..."

In situations the place a failure happens, handbook intervention is important to rectify the problem. This may contain switching to a extra strong mannequin, fine-tuning the parameters, or making slight changes to the immediate. To provoke the troubleshooting course of, it’s important to create a backup of the batch job output. This serves as a vital reference level for evaluation and problem-solving.

backup_student_answer_df = student_answer_df.copy()

Manually execute the unsuccessful circumstances by adjusting the next code and working them once more.

print(f"Complete failed circumstances: {len(failed_cases)}")

orginal_model_name = model_name
# It's possible you'll change to make use of extra highly effective mannequin
# model_name = "codechat-bison@002"

if len(failed_cases) > 0:
print("Failed circumstances:")
for failed_case in failed_cases:
# Get row from student_answer_df by Listing
row = student_answer_df.loc[student_answer_df['Directory'] == failed_case["directory"]]
query = failed_case['question']
instruction = standard_answer_dict[question]["Instruction"]
starter = standard_answer_dict[question]["Starter"]
reply = standard_answer_dict[question]["Answer"]
mark = standard_answer_dict[question]["Mark"]
student_answer = row[question + " Content"]
student_commit = row[question + " Commit"]
consequence = score_answer(instruction, starter, reply, mark, student_answer, student_commit, temperature=0.3)
#replace student_answer_df with consequence
row[question + " Score"] = consequence.rating
row[question + " Comments"] = consequence.feedback
row[question + " Calculation"] = consequence.calculation
# exchange row in student_answer_df
# student_answer_df.loc[student_answer_df['Directory'] == failed_case["directory"]] = row
#take away failed case from failed_cases
failed_cases.take away(failed_case)

model_name = orginal_model_name

Based mostly on expertise, a lot of the circumstances will be resolved by altering some parameter.

The output of human_review.ipynb

On this method, we leverage the ability of a pretrained Language Mannequin (LLM), particularly the code chat mannequin from Vertex AI, to attain college students’ programming assignments. In contrast to conventional unit testing strategies, this method permits for partial credit score to be awarded even when the submitted code isn’t totally runnable.

Key to this course of is the crafting of a well-structured chat immediate that comes with important data corresponding to directions, starter code, reply, mark, pupil reply, and pupil commit standing. This immediate guides the LLM in evaluating the scholar’s code.

The LLM is configured with a low temperature setting to make sure exact and constant scoring. A PydanticOutputParser is employed to extract the specified output format directions in a Python object.

By seamlessly integrating these elements, we set up a clean workflow that effectively manages prompts and makes use of the LLM’s capabilities. This permits correct and dependable scoring of programming assignments, even for partially right code submissions.

This method addresses the challenges confronted by educators in grading programming assignments, reduces handbook effort, and promotes truthful and constant evaluation.

Mission collaborators embrace, Markus, Kwok,Hau Ling, Lau Hing Pui, and Xu Yuan from the IT114115 Greater Diploma in Cloud and Knowledge Centre Administration and Microsoft Study Pupil Ambassadors candidates

Cyrus Wong is the senior lecturer of Hong Kong Institute of Data Expertise and he focuses on educating public Cloud applied sciences. A passionate advocate for cloud tech adoption in media and occasions — AWS Machine Studying Hero, Microsoft MVP — Azure, and Google Developer Professional — Google Cloud Platform.


Leave a Reply

Your email address will not be published. Required fields are marked *