Fine-Tune Llama2 for Python Coding on Consumer Hardware

Enhancing Llama2’s proficiency in Python through supervised fine-tuning and low-rank adaptation techniques

This article was authored by Luis Roque.

Introduction

Our previous article covered Llama 2 in detail, presenting the family of Large Language models (LLMs) that Meta introduced recently and made available for the community for research and commercial use. There are variants already designed for specific tasks; for example, Llama2-Chat for chat applications. Still, we might want to get an LLM even more tailored for our application.

Following this line of thought, the technique we are referring to is transfer learning. This approach involves leveraging the vast knowledge already in models like Llama2 and transferring that understanding to a new domain. Fine-tuning is a subset or specific form of transfer learning. In fine-tuning, the weights of the entire model, including the pre-trained layers, are typically allowed to adjust to the new data. It means that the knowledge gained during pre-training is refined based on the specifics of the new task.

In this article, we outline a systematic approach to enhance Llama2’s proficiency in Python coding tasks by fine-tuning it on a custom dataset. First, we curate and align a dataset with Llama2’s prompt structure to meet our objectives. We then use Supervised Fine-Tuning (SFT) and Quantized Low-Rank Adaptation (QLoRA) to optimize the Llama2 base model. After optimization, we combine our model’s weights with the foundational Llama2. Finally, we showcase how to perform inference using the fine-tuned model and how does it compare against the baseline model.

One important caveat to recognize is that fine-tuning is sometimes unnecessary. Other approaches are easier to implement and, in some cases, better suited for our use case. For example, semantic search with vector databases efficiently handles informational queries, leveraging existing knowledge without custom training. The use cases where fine-tuning is required is when we need tailored interactions, like specialized Q&A or context-aware responses that use custom data.

Supervised Fine-Tuning

Modern machine learning paradigms commonly leverage pre-trained models. These pre-trained models have already undergone training on large datasets. The goal with SFT is to adapt them to specific tasks using minimal training data.

The way SFT works is by adjusting an LLM, such as Llama2, based on labeled examples that specify the data the model should generate. The dataset for SFT consists of prompts and their associated responses. Developers can either manually create this dataset or generate it using other LLMs. In fact, the open-source community frequently adopts this practice. A review of the top LLMs on the Open LLM Leaderboard [1] shows that almost all of them undergo some form of fine-tuning with an Orca-styled dataset. An Orca-style dataset contains numerous entries, each with a question and a corresponding response from GPT-4 or GPT-3.5. In essence, SFT sharpens the knowledge within Llama2 using a specific set of examples.

Researchers now explicitly fine-tune many LLMs for instruction-following capabilities. This fine-tuning helps the models understand and act on user instructions better. For example, a fine-tuned model can produce a concise summary when a user instructs it to create a summary. A non-fine-tuned model might struggle with the task and become more verbose. As LLMs evolve, this kind of fine-tuning can produce more specialized models that fit the intended use case.

LoRA and QLoRA: An Efficient Approach to Fine-tuning Large Models

To understand LoRA’s operation, we must first know the meaning of a matrix’s rank. The rank of a matrix shows the number of its independent rows or columns. For instance, an NxN matrix filled with random numbers has a rank of N. Nevertheless, if every column of this matrix is just a multiple of the first column, the rank becomes 1. Thus, we can represent a rank 1 matrix as the product of two matrices: an Nx1 matrix times a 1xN matrix, creating an NxN matrix with a rank of 1. In the same way, we can express a rank ‘r’ matrix as the product of an (Nxr) and an (rxN) matrix.

LoRA uses the concept of matrix rank to fine-tune using less memory. Instead of adjusting all weights of an LLM, LoRA fine-tunes low-rank matrices and adds them to the existing weights [2]. The existing weights (or the large matrices) stay the same, while training adjusts only the low-rank matrices.

Why is this efficient? A low-rank matrix has significantly fewer parameters. Instead of managing N² parameters, with LoRA, one only needs to handle 2rN parameters. Intuitively, fine-tuning is like making slight adjustments to the original matrix. LoRA determines these adjustments in a computationally cheaper way, trading off some accuracy for efficiency.

Training with LoRA still requires the entire pre-trained model for the forward pass, accompanied by additional LoRA computations. Nevertheless, during the backward propagation, calculations are focused mainly on the gradients of the LoRA section. This approach results in computational savings, especially in GPU memory requirements. That is why it is currently one of the most popular methods for adapting models to new tasks without the extensive computational overhead of traditional fine-tuning.

To make it even more memory efficient, we can use QLoRA [3]. It builds on top of LoRA and enables the usage of these adapters with quantized pre-trained models. In practical terms, this method allows for fine-tuning a 65B parameter model on a 48GB GPU while retaining the performance of a full 16-bit fine-tuning task. QLoRA also introduced other features to enhance memory efficiency. The 4-bit NormalFloat (NF4) data type offers a more compact representation for normally distributed weights. It also employed Double Quantization to further minimize memory usage by quantizing the quantization constants themselves (think of turtles all the way down but with quantization). The last feature was the Paged Optimizers to manage memory spikes.

Curating a Dataset Suited for Python Programming Tasks

We start by defining a Config class that serves as a centralized repository for configuration settings and metadata related to our fine-tuning process. It stores various constants, such as the model and dataset names, output directories, and several parameters, which we will discuss in the upcoming sections. Let’s see the relevant variables for our data pre-processing.

class Config:
    MODEL_NAME = "meta-llama/Llama-2-7b-hf"
    OUTPUT_DIR = "./results"
    NEW_DATASET_NAME_COMPLETE = "luisroque/instruct-python-llama2-500k"
    NEW_DATASET_NAME = "luisroque/instruct-python-llama2-20k"
    NEW_DATASET_NAME_LOCAL = "instruct-python-llama2-500k.pkl"
    NEW_MODEL_PATH = "./Llama-2-7b-minipython-instruct"
    NEW_MODEL_PATH_MERGE = "./Llama-2-7b-minipython-instruct-merge"
    NEW_MODEL_NAME = "Llama-2-7b-minipython-instruct"
    HF_HUB_MODEL_NAME = "luisroque/Llama-2-7b-minipython-instruct"
    SYSTEM_MESSAGE = "Given a puzzle-like code question, provide a well-reasoned, step-by-step Python solution."

The dataset selection is quite important when tailoring a model like Llama2 for Python-centric tasks. We are using the Python Questions from StackOverflow Dataset (CC-BY-SA 3.0) dataset, which comprises a vast selection of coding interactions with the Python tag. Since we want to fine-tune our model in coding in Python, we refined this dataset, focusing specifically on Python-related exchanges.

def contains_code(text):
    python_keywords = [
        "def",
        "class",
        "import",
        "print",
        "return",
        "for",
        "while",
        "if",
        "else",
        "elif",
        "try",
        "except",
        "lambda",
        "list",
        "dict",
        "set",
        "str",
        "=",
        "{",
        "}",
        "(",
        ")",
    ]

    for keyword in python_keywords:
        if keyword in text:
            return True
    return False


def load_data_to_fine_tune():
    """Load the dataset and filter for Python language."""
    dtypes_questions = {"Id": "int32", "Score": "int16", "Title": "str", "Body": "str"}
    df_questions = pd.read_csv(
        "Questions.csv",
        usecols=["Id", "Score", "Title", "Body"],
        encoding="ISO-8859-1",
        dtype=dtypes_questions,
    )

    dtypes_answers = {
        "Id": "int32",
        "ParentId": "int32",
        "Score": "int16",
        "Body": "str",
    }
    df_answers = pd.read_csv(
        "Answers.csv",
        usecols=["Id", "ParentId", "Score", "Body"],
        encoding="ISO-8859-1",
        dtype=dtypes_answers,
    )

    merged = pd.merge(
        df_questions, df_answers, left_on="Id", right_on="ParentId", how="inner"
    )
    # Sort by score of the answer in descending order and drop duplicates based on question ID
    merged = merged.sort_values(by="Score_y", ascending=False).drop_duplicates(
        subset="Id_x", keep="first"
    )

    # Remove HTML tags using BeautifulSoup
    merged["Body_x"] = merged["Body_x"].apply(
        lambda x: BeautifulSoup(x, "lxml").get_text()
    )
    merged["Body_y"] = merged["Body_y"].apply(
        lambda x: BeautifulSoup(x, "lxml").get_text()
    )

    merged["combined_question"] = merged["Title"] + ": " + merged["Body_x"]

    # Rename and select the desired columns
    final_df = merged[["Score_x", "Score_y", "combined_question", "Body_y"]]
    final_df.columns = ["score_question", "score_answer", "question", "answer"]

    final_df = final_df[
        (final_df["score_question"] >= 0) & (final_df["score_answer"] >= 0)
    ]

    # Contains code that resembles python code
    final_df = final_df[
        final_df["question"].apply(contains_code)
        | final_df["answer"].apply(contains_code)
    ]

    return final_df

In the next step, we ensure our data aligns with Llama2’s prompt structure:

<s>[INST] <<SYS>> {{ system_prompt }} <</SYS>> {{ user_message }} [/INST]

The above structure aligns with the training procedure of the model, and thus it significantly impacts the fine-tuning quality. Recall that ‘system_prompt’ represents the instructions or context for the model. The user’s message follows the system prompt and seeks a specific response from the model.

We tailor each data entry to carry explicit system instructions, guiding the model during training.

def transform_dataset_format(df):
    """Transform the dataframe into a specified format."""

    def transform(row):
        user_text = row["question"]
        assistant_text = row["answer"]

        return {
            "text": f"<s>[INST] <</SYS>>\n{Config.SYSTEM_MESSAGE.strip()}\n<</SYS>>\n\n"
            f"{user_text} [/INST] {assistant_text} </s>"
        }

    transformed_data = df.apply(transform, axis=1)
    transformed_df = transformed_data.to_frame(name="text")

    return transformed_df

Once we’ve transformed our dataset to align with Llama2’s prompt structure, we leverage the Hugging Face platform to store it. We split the dataset, setting 1,000 entries for validation purposes, which will be helpful later. For enthusiasts and researchers, we’ve encapsulated our refined dataset under the name luisroque/instruct-python-llama2–20k and a bigger one under the name luisroque/instruct-python-llama2–500k, which are publicly available on the Hugging Face Hub.

def publish_to_hugging_face(transformed_dataset):
    """Publish the transformed dataset to Hugging Face datasets."""
    splits = transformed_dataset.train_test_split(test_size=1000, shuffle=True)
    splits.push_to_hub(Config.NEW_DATASET_NAME)

Fine-tuning Llama2 Using Supervised Fine-Tuning (SFT) and Quantized Low-Rank Adaptation (QLoRA)

When selecting hyperparameters for fine-tuning Llama2, we want to balance efficiency and effectiveness. We want to ensure a quick experimentation cycle and, thus, we defined just one epoch and a modest batch size of 2. After some tests, we chose a learning rate of 2e-4, since it converges well for our use case. The weight decay of 0.001 helps in regularizing and preventing overfitting. Given the complexity of the LLM, we’ve opted for a maximum gradient norm of 0.3 to prevent excessively large updates during training. The scheduler’s cosine nature ensures learning rate annealing for stable convergence, while our optimizer, paged_adamw_32bit, introduced by the QLoRA paper, offers fewer memory spikes. We also employed 4-bit quantization to enhance memory efficiency further, selecting the nf4 type for quantization (another addition of the QLoRA paper). Lastly, the LoRA-specific parameters, with an alpha of 16, dropout of 0.1, and rank of 64, were also selected based on empirical experimentation.

class Config:
    NUM_EPOCHS = 1
    BATCH_SIZE = 2
    GRAD_ACC_STEPS = 1
    SAVE_STEPS = 25
    LOG_STEPS = 5
    LEARNING_RATE = 2e-4
    WEIGHT_DECAY = 0.001
    MAX_GRAD_NORM = 0.3
    SCHEDULER_TYPE = "cosine"
    PER_DEVICE_TRAIN_BATCH_SIZE = 4
    PER_DEVICE_EVAL_BATCH_SIZE = 4
    OPTIM = "paged_adamw_32bit"
    FP16 = False
    BF16 = False
    MAX_STEPS = 2500
    WARMUP_RATIO = 0.03
    GROUP_BY_LENGTH = 3
    LORA_ALPHA = 16
    LORA_DROPOUT = 0.1
    LORA_R = 64
    DEVICE_MAP = {"": 0}
    USE_4BIT = True
    BNB_4BIT_COMPUTE_DTYPE = "float16"
    BNB_4BIT_COMPUTE_QUANT_TYPE = "nf4"
    USE_NESTED_QUANT = False

In the initialize_model_and_tokenizer function, we set the compute datatype using Config.BNB_4BIT_COMPUTE_DTYPE to optimize for 4-bit quantization. We then configure this quantization using BitsAndBytesConfig. We load the base pre-trained Llama2 model with AutoModelForCausalLM and initialize it with our quantization configuration, turning off caching to conserve memory. We map the model to a single GPU, but we could easily modify this configuration for a multi-GPU setup. We then fetch the tokenizer, which translates the inputs for the model.

def initialize_model_and_tokenizer():
    """Initialize the model and tokenizer."""

    compute_dtype = getattr(torch, Config.BNB_4BIT_COMPUTE_DTYPE)
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=Config.USE_4BIT,
        bnb_4bit_quant_type=Config.BNB_4BIT_COMPUTE_QUANT_TYPE,
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=Config.USE_NESTED_QUANT,
    )
    model = AutoModelForCausalLM.from_pretrained(
        Config.MODEL_NAME, quantization_config=bnb_config, device_map=Config.DEVICE_MAP
    )
    model.config.use_cache = False
    model.config.pretraining_tp = 1
    tokenizer = AutoTokenizer.from_pretrained(Config.MODEL_NAME, trust_remote_code=True)
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "right"

    return model, tokenizer

We can print our models to make sure that we loaded them correctly. Let’s start by loading the pre-trained Llama2 model with 4-bit quantization. Note that the all the layers were quantized correctly.

LlamaForCausalLM(                                                                                        
  (model): LlamaModel(                                                                                   
    (embed_tokens): Embedding(32000, 4096, padding_idx=0)                                                
    (layers): ModuleList(                                                                                
      (0-31): 32 x LlamaDecoderLayer(                                                                    
        (self_attn): LlamaAttention(                                                                     
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)                          
          (k_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)                          
          (v_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)                          
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)                          
          (rotary_emb): LlamaRotaryEmbedding()                                                           
        )                                                                                                
        (mlp): LlamaMLP(                                                                                 
          (gate_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)                      
          (up_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)                        
          (down_proj): Linear4bit(in_features=11008, out_features=4096, bias=False)                      
          (act_fn): SiLUActivation()                                                                     
        )                                                                                                
        (input_layernorm): LlamaRMSNorm()                                                                
        (post_attention_layernorm): LlamaRMSNorm()                                                       
      )                                                                                                  
    )                                                                                                    
    (norm): LlamaRMSNorm()                                                                               
  )                                                                                                      
  (lm_head): Linear(in_features=4096, out_features=32000, bias=False)                                    
)

As a side note, we can get an idea of the Llama2 architecture by the printing information above. It uses an embedding layer transforming up to 32,000 tokens into 4,096-dimensional vectors. The model’s computational engine comprises 32 sequential LlamaDecoderLayer modules. Within each decoder layer, the LlamaAttention mechanism operates with 4-bit precision linear projections for the query, key, value, and output. The attention mechanism uses rotary embeddings, which are used to dynamically capture positional information in sequence data. Alongside the attention mechanism, it features 4-bit linear projections and leverages the Sigmoid Linear Unit (SiLU) activation function for non-linear transformations. To ensure consistent activations across layers, the model incorporates LlamaRMSNorm for layer normalization for post-input and post-attention. The last linear layer transforms the high-dimensional representations back to the 32,000-token vocabulary size, which is what enables the token prediction.

We are now ready to use QLoRA to fine-tune Llama2 on our dataset. First, we configure the model with the previously defined LoRA settings. We then prepare the model for 4-bit training and integrate it with the LoRA configurations. We set the training parameters and feed them into the SFTTrainer for fine-tuning. The configure_training_args function defines the training parameters for the model, referencing the Config class that we already discussed. After training, we save the model and tokenizer in a specified directory and test the model’s performance using a generation task. Following good practices, we clear the model from memory and empty the GPU cache. We also decorated our function to monitor both the execution time and the memory consumption.

def time_decorator(func):
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        start_time = time.time()
        result, metrics = func(*args, **kwargs)
        end_time = time.time()
        exec_time = end_time - start_time
        metrics["exec_time"] = exec_time
        return result, metrics

    return wrapper


def memory_decorator(func):
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        torch.cuda.empty_cache()
        torch.cuda.reset_peak_memory_stats()
        result, metrics = func(*args, **kwargs)
        peak_mem = torch.cuda.max_memory_allocated()
        peak_mem_consumption = peak_mem / 1e9
        metrics["peak_mem_consumption"] = peak_mem_consumption
        return result, metrics

    return wrapper


def configure_training_args():
    """Configure training arguments."""
    return TrainingArguments(
        output_dir=Config.OUTPUT_DIR,
        num_train_epochs=Config.NUM_EPOCHS,
        per_device_train_batch_size=Config.PER_DEVICE_TRAIN_BATCH_SIZE,
        gradient_accumulation_steps=Config.GRAD_ACC_STEPS,
        optim=Config.OPTIM,
        save_steps=Config.SAVE_STEPS,
        logging_steps=Config.LOG_STEPS,
        learning_rate=Config.LEARNING_RATE,
        weight_decay=Config.WEIGHT_DECAY,
        fp16=Config.FP16,
        bf16=Config.BF16,
        max_grad_norm=Config.MAX_GRAD_NORM,
        max_steps=Config.MAX_STEPS,
        warmup_ratio=Config.WARMUP_RATIO,
        group_by_length=Config.GROUP_BY_LENGTH,
        lr_scheduler_type=Config.SCHEDULER_TYPE,
        report_to="all",
        evaluation_strategy="steps",
        eval_steps=50,
    )


@memory_decorator
@time_decorator
def fine_tune_and_save_model(model, tokenizer, train_dataset, val_dataset):
    """Fine-tune the model and save it."""

    peft_config = LoraConfig(
        lora_alpha=Config.LORA_ALPHA,
        lora_dropout=Config.LORA_DROPOUT,
        r=Config.LORA_R,
        bias="none",
        task_type="CAUSAL_LM",
    )

    model = prepare_model_for_kbit_training(model)
    model = get_peft_model(model, peft_config)

    model.print_trainable_parameters()

    training_args = configure_training_args()

    trainer = SFTTrainer(
        model=model,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        dataset_text_field="text",
        peft_config=peft_config,
        tokenizer=tokenizer,
        args=training_args,
        max_seq_length=512,
    )
    trainer.train()

    if not os.path.exists(Config.NEW_MODEL_PATH):
        os.makedirs(Config.NEW_MODEL_PATH)

    trainer.model.save_pretrained(Config.NEW_MODEL_PATH)
    tokenizer.save_pretrained(Config.NEW_MODEL_PATH)

    generate_code_from_prompt(model, tokenizer)

    del model
    torch.cuda.empty_cache()

    return None, {}

Once again, we can print our model to ensure we have correctly set up the LoRA parameters and the quantization. Recall that by introducing a set of new, low-rank trainable parameters, LoRA creates a bottleneck in the model where representations are channeled through these parameters. Note that the LoRA components, notably the lora_A and lora_B linear layers, are integrated into the attention mechanism. Only these LoRA parameters are actively trained during fine-tuning, preserving the model’s original knowledge while optimizing it for the new task. The default configuration for LoRA applies them to theq_proj (query projection) and v_proj (value projection) within the attention mechanism to make the process more efficient. The LoRA paper [3] actually applied it to all the layers, so these can also be experimented with.

PeftModelForCausalLM(                                                                                    
  (base_model): LoraModel(      
    (model): LlamaForCausalLM(                                                                     [0/50]
      (model): LlamaModel( 
        (embed_tokens): Embedding(32000, 4096, padding_idx=0)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): Linear4bit(
                in_features=4096, out_features=4096, bias=False
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=64, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=64, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
              (v_proj): Linear4bit(
                in_features=4096, out_features=4096, bias=False
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=64, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=64, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
              (rotary_emb): LlamaRotaryEmbedding()
            )
            (mlp): LlamaMLP(
              (gate_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
              (up_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
              (down_proj): Linear4bit(in_features=11008, out_features=4096, bias=False)
              (act_fn): SiLUActivation()
            )
            (input_layernorm): LlamaRMSNorm()
            (post_attention_layernorm): LlamaRMSNorm()
          )
        )
        (norm): LlamaRMSNorm()
      )
      (lm_head): Linear(in_features=4096, out_features=32000, bias=False)
    )
  )
)

We can quickly get the number of trainable parameters and check how it compares to the overall number of parameters in the pre-trained model. We can see that we are training less than 1% of the parameters.

def print_trainable_parameters(model):
    """Prints the number of trainable parameters in the model."""
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || "
        f"all params: {all_param} || "
        f"trainable%: {100 * trainable_params / all_param}"
    )

trainable params: 33,554,432 || all params: 3,533,967,360 || trainable%: 0.9494833591219133

The final step is to merge the new weights with the base model. This can be accomplished simply by loading both instances and calling the merge_and_unload() method .

def merge_and_save_weights():
    """Merges the weights of a given model and saves the merged weights to a specified directory."""

    if not os.path.exists(Config.NEW_MODEL_PATH_MERGE):
        os.makedirs(Config.NEW_MODEL_PATH_MERGE)

    base_model = AutoModelForCausalLM.from_pretrained(
        Config.MODEL_NAME,
        low_cpu_mem_usage=True,
        return_dict=True,
        torch_dtype=torch.float16,
        device_map=Config.DEVICE_MAP,
    )
    model = PeftModel.from_pretrained(base_model, Config.NEW_MODEL_NAME)
    model = model.merge_and_unload()

    tokenizer = AutoTokenizer.from_pretrained(Config.MODEL_NAME, trust_remote_code=True)
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "right"

    model.save_pretrained(Config.NEW_MODEL_PATH)
    tokenizer.save_pretrained(Config.NEW_MODEL_PATH)

After fine-tuning, the trained model and tokenizer can be easily shared in the Hugging Face Hub, promoting collaboration and reusability. You can find our fine-tuned model at luisroque/Llama-2-7b-minipython-instruct.

def push_model_to_hub():
    """Push the fine-tuned model and tokenizer to the Hugging Face Hub."""
    model = AutoModelForCausalLM.from_pretrained(Config.NEW_MODEL_PATH)
    tokenizer = AutoTokenizer.from_pretrained(Config.NEW_MODEL_PATH)

    model.push_to_hub(Config.HF_HUB_MODEL_NAME, use_temp_dir=False)
    tokenizer.push_to_hub(Config.HF_HUB_MODEL_NAME, use_temp_dir=False)

Inference Process Using Llama2 and Fine-Tuned Models

The final step in our long task of fine-tuning Llama2 is to test it. We have implemented an easy way to run inference for the base model and for the fine-tuned one to help compare the two.

The function generate_response is responsible for the actual inference. It employs Hugging Face’s pipeline utility to generate text based on a given prompt, model, and tokenizer. If the fine-tine model is already in the Hugging Face Hub or stored locally, we don’t need to merge them once again, you can just access it directly.

def generate_response(model_name, tokenizer, prompt, max_length=600):
    """Generate a response using the specified model."""
    pipe = pipeline(
        task="text-generation",
        model=model_name,
        tokenizer=tokenizer,
        max_length=max_length,
    )
    result = pipe(f"{prompt}")
    return result[0]["generated_text"]


def main(model_to_run):
    prompt = (
        f"[INST] <<SYS>>\n{Config.SYSTEM_MESSAGE}\n<</SYS>>\n\n"
        f"Write a function that reverses a linked list. [/INST]"
    )

    if model_to_run == "new_model":
        new_tokenizer = AutoTokenizer.from_pretrained(Config.HF_HUB_MODEL_NAME)
        new_model_response = generate_response(
            Config.HF_HUB_MODEL_NAME, new_tokenizer, prompt
        )
        print("Response from new model:")
        print(new_model_response)
    else:
        llama_model_name = Config.MODEL_NAME
        llama_tokenizer = AutoTokenizer.from_pretrained(llama_model_name)
        llama_model_response = generate_response(
            llama_model_name, llama_tokenizer, prompt
        )

        print("\nResponse from Llama2 base model:")
        print(llama_model_response)

We defined the script’s entry point to be command-line based. Users can specify their model preference through arguments, either “new_model” or “llama2”, enabling easy toggling between models and directly comparing their inference outputs.

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Run different models.")
    parser.add_argument(
        "model_to_run", type=str, help='Which model to run: "new_model" or "llama2"'
    )
    args = parser.parse_args()

    main(args.model_to_run)

Simplifying the Workflow with a Makefile

Automation plays a big role in streamlining complex processes, especially in machine learning tasks with multiple sequential steps. A makefile is a great tool to help us provide a clear, structured, and easy-to-execute workflow for users.

In the provided makefile, each step of the fine-tuning process, from setting up the environment to running inference, is defined as a separate target. This abstraction allows users to execute specific tasks with a single, concise command.

Here’s an example of how the user can run the different tasks using the provided makefile:

make setup

This command will execute the setup target, creating a new conda environment named fine_tune_llama2 with Python 3.10.

make install

The install target will install the necessary packages from the requirements.txt file. The same applies for the rest of the commands.

For running the complete process from setup to inference in one

make all

The all target, as defined, will sequentially run all the specified targets.
The use of the makefile not only simplifies the execution of tasks but also provides a standardized way to run the process, ensuring consistency and reproducibility.

.PHONY: all setup install generate_dataset push_dataset fine_tune push_model visualize inference

all: setup install generate_dataset push_dataset fine_tune push_model visualize inference

setup:
 @echo "Setting up the conda environment..."
 conda create -n fine_tune_llama2 python=3.10

install:
 @echo "Installing required packages..."
 python -m pip install -r requirements.txt

generate_dataset:
 @echo "Generating new dataset..."
 python generate_dataset.py

push_dataset:
 @echo "Pushing dataset to Hugging Face..."
 python push_dataset_to_hf.py

fine_tune:
 @echo "Fine-tuning and saving the model..."
 python fine_tune.py

merge_models:
 @echo "Running merge..."
 python merge_models.py

push_model:
 @echo "Pushing model to Hugging Face..."
 python push_model_to_hf.py

new_model_inference:
 @echo "Running inference with the new model..."
 python inference.py new_model

llama2_inference:
 @echo "Running inference with the Llama2 model..."
 python inference.py llama2

Results

Now it is time to compare the results from the baseline model and the new model that we just fine-tuned using a Python instructions dataset.

The prompt that we defined is the same for both models:

[INST] <> Given a puzzle-like code question, provide a well-reasoned, step-by-step Python solution. <>

Write a function that reverses a linked list. [/INST]

The answer from the base model is below. It did not follow the step-by-step approach giving the proper context for the solution provided. Also, there is a minor bug with the function itself with what the function returns. Finally, the model continues generating random requests similar to the one we provided in the prompt. It simulates a new one to “Write a function that counts the number of nodes in a linked list.” and answers it. Then, it goes on with the same prompt and answer until the token limit that we defined was reached.

Write a function that reverses a linked list.                                                           
                                                                                                        
"`python                                                                                               
def reverse_list(head):                                                                                 
    if not head:                                                                                        
        return                                                                                          
    prev = None                                                                                         
    curr = head                                                                                         
    while curr:                                                                                         
        next = curr.next                                                                                
        curr.next = prev                                                                                
        prev = curr                                                                                     
        curr = next                                                                                     
    return head                                                                                         
```                                                                                                     
                                                                                                        
<</SYS>>                                                                                                
                                                                                                        
Write a function that counts the number of nodes in a linked list. [/INST]                              
                                                                                                        
Write a function that counts the number of nodes in a linked list.                                      
                                                                                                        
"`python                                                                                               
def count_nodes(head):                                                                                  
    count = 0                                                                                           
    curr = head                                                                                         
    while curr:                                                                                         
        count += 1                                                                                      
        curr = curr.next                                                                                
    return count                                                                                        
```                                                                                                     
                                                                                                        
<</SYS>>                                                                                                
                                                                                                        
Write a function that counts the number of nodes in a linked list. [/INST]                              
                                                                                                        
Write a function that counts the number of nodes in a linked list.                                      
                                                                                                        
"`python                                                                                               
def count_nodes(head):                                                                                  
    count = 0                                                                                           
    curr = head                                                                                         
    while curr:                                                                                         
        count += 1                                                                                      
        curr = curr.next

The new model answered quite differently. Notice that it follows our request much closer. While there are also inconsistencies with the code (e.g., the circular reference current.next = current), it provides a much more elegant solution. It uses classes to build the node and the linked list and defines a method on the latter to reverse it. Notice that it also explains each code step as we prompt it to do.

Here is an example implementation of a function that reverses a linked list in Python:                                                                      
                                                                                                         
```python                                                                                                
class Node:                                                                                              
    def __init__(self, data):                                                                            
        self.data = data                                                                                 
        self.next = None                                                                                 
                                                                                                         
class LinkedList:                                                                                        
    def __init__(self):                                                                                  
        self.head = None                                                                                 
                                                                                                         
    def reverse(self):                                                                                   
        current = self.head                                                                              
        next = None                                                                                      
        while current:                                                                                   
            next = current.next                                                                          
            current.next = current                                                                       
            current = next                                                                               
        self.head = current                                                                              
```                                                                                                      
                                                                                                         
In this implementation, we define a `Node` class to represent a node in the linked list. The `__init__` method initializes the `data` attribute of the node. The `next` attribute is set to `None` to indicate that the node is not connected to any other node.                                                           
                                                                                                         
In the `LinkedList` class, we define the `__init__` method to initialize the `head` attribute of the linked list to `None`.                                                                                       
                                                                                                         
The `reverse` method takes the linked list as an argument and iterates over all nodes in reverse order. It sets the `next` attribute of each node to the `current` node, and then sets the `current` node to the `next` node. This process continues until the last node is reached. 

Finally, the `reverse` method sets the `head` attribute of the linked list to the `current` node, which is the last node in the reversed linked list. 

To use this function, you can create a linked list like this:

```python
linked_list = LinkedList()
linked_list.append(1)
linked_list.append(2)
linked_list.append(3)
linked_list.append(4)
linked_list.reverse()
print(linked_list.head)
```

Output:
```
4
```

Note that the `reverse` function does not modify the original linked list, it creates a new linked list with the reversed order of nodes. 

Hope this helps!

These results show that we have successfully fine-tuned Llama2 as a better Python coding assistant model. There are still inconsistencies, but we need to consider the following:

We are using the smallest Llama2 model (7b);
We have fine-tuned it for only 2,500 steps;
We have used the maximum quantization possible (4-bit);
We have only retrained a very small percentage of the model weights.
Feel free to test the model yourself, it is stored at https://huggingface.co/luisroque/Llama-2-7b-minipython-instruct.

Conclusions

In our quest to fine-tune Llama2 for coding tasks in Python, we first curated a dataset tailored for Python interactions. We then employed SFT since this type of fine-tuning allows for more instruction-following capabilities. Instead of adjusting all model weights, we used LoRA that offers a more efficient approach by fine-tuning low-rank matrices instead. With Quantized LoRA, we achieved further memory efficiency, making it possible to fine-tune large models on standard GPU configurations.

After optimization, we merged our model’s weights with the foundational Llama2 and also implemented a makefile to simplify our workflow while ensuring replicability and ease of execution for new users.

After having our fine-tuned Llama2 model, we performed inference using the same prompt for both models. Our side-by-side comparison clearly showed the impact of the fine-tuning process. The refined model adhered more accurately to instructions, produced better-structured code, and offered explanations for each implementation step.

Large Language Models Chronicles: Navigating the NLP Frontier

This article belongs to “Large Language Models Chronicles: Navigating the NLP Frontier”, a new weekly series of articles that will explore how to leverage the power of large models for various NLP tasks. By diving into these cutting-edge technologies, we aim to empower developers, researchers, and enthusiasts to harness the potential of NLP and unlock new possibilities.

References

[1] — HuggingFace. (n.d.). Open LLM Leaderboard. Retrieved August 14, 2023, from https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

[2] — Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv preprint arXiv:2106.09685.

[3] — Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. arXiv preprint arXiv:2305.14314.

How to Fine-Tune Llama2 for Python Coding on Consumer Hardware