A Defacto Guide on Building Generative AI Apps with the Google PaLM API

Issue # 14 | Deep dive on how to access/call the PaLM API (MakerSuite, Vertex AI client libs, REST api), implementing a useful task (structured data extraction), developer notes.

Sep 06, 2023

PaLM is a transformer-based large language model that can be used in building Generative AI applications.

Generative AI models such as large language models (LLMs) offer developers an opportunity to build new experiences and offer value to end users. Tools like #ChatGPT powered by GPT3.5 and GPT4 models from OpenAI have demonstrated the capabilities of these models.

Similar to GPT models, PaLM is a transformer-based foundation model offered by Google as an API service. As a developer, understanding the capabilities of LLMs from multiple providers (e.g., OpenAI, Google, Anthropic, Cohere) can be valuable in making software design decisions (model selection, effort estimation, limitations, etc). In this post, I’ll dig into what I’ve learned while exploring the PaLM api, covering the following:

TLDR;

ModelOverview: Overview of the PaLM model architecture (it is a transformer based model, trained on a mixture of language modeling objectives and extensive compute).
Api Interfaces : Pros/cons of different approaches to calling the PaLM api (MakerSuite vs Vertex Client Libraries vs Vertex REST Api).
Use Case Implementation: Implementation and performance on a concrete/useful task - structured data extraction. We’ll use PaLM to analyze multiple book summaries (from the CMU books Summary dataset), extract a list of actors, their actions, relevance to a given user profile and plot these stats to extract insights.
Developer notes specific to the PaLM model. E.g., the API provides valuable citations for some responses, responses may be blocked due to safety filters, low-level prompting requirements, instruction following capabilities, etc

Note: This post focuses on text generation models fine tuned on multi-turn conversation applications (chat). It does not cover embedding models, multimodal models etc.

Have you tried the PaLM model? Share your thoughts and insights in the comments!

The PaLM Model

PaLM (Pathways Language Model) is a family of transformer-based models trained using Google’s pathways systems approach (aimed at improving training efficiency by scaling). Google published an initial version of the PALM model architecture in this paper (540B parameters, decoder only transformer architecture) and later, a subsequent technical paper (with limited details) on an updated version of the model - PaLM 2.

The PaLM2 technical paper indicates a focus on reduced model size, compute optimal scaling ( smaller parameters but more compute), improved dataset curation and using a mixture of language modeling objectives (similar to UL2).

I expect that the exact size, training data, language modeling objective will evolve as the service is evolved and as mentioned in the PaLM 2 paper.

When discussing the PaLM 2 family, it is important to distinguish between pre-trained models (of various sizes), fine-tuned variants of these models, and the user-facing products that use these models. In particular, user-facing products typically include additional pre- and post-processing steps. Additionally, the underlying models may evolve over time.
PaLM 2 Technical Report

PaLM Model Use Cases, Size, Parameters

A subset of Generative AI models available on Vertex AI. Full list here.

PaLM models span a fairly wide range of use cases including LLMs for text generation, code generation, chat-finetuned models (models fine tuned to work for conversational multi-turn chat settings), text embeddings, and models with multimodal capabilities. In addition, these models are also categorized based on size (code named gecko for smaller models and bison for larger models), with cost implications.

Given how fast the field changes, the reader is encouraged to view the official pages on available models and Vertex AI pricing for up-to-date information.

Similar to most LLMs, PaLM models support several parameters, the most important being:

Temperature: (control the degree of randomness in the generated response),
maxOutputTokens (Maximum number of tokens that can be generated in the response) and
candidateCount (number of response variations to return)
messages: a list data structure containing author and content fields that serve as prompts. This is specific to chat-finetuned models.

Accessing The PaLM API - MakerSuite vs Vertex Client Libraries vs Vertex REST API

Now that we have some technical details out of the way, let’s explore methods to access and build with the api. As at time of writing, there are 3 main ways to access the PaLM api - the Makersuite, Vertex AI Client libraries, Vertex AI REST API. While this can be confusing, there are important pros/cons that can inform when to use which approach.

MakerSuite

✅ MakerSuite focuses on being developer friendly and provides a simplified api for accessing generative ai apis. It follows the approach by similar services such as OpenAI, Cohere, Anthropic, and requires an api key (which you can get by joining a waitlist).

⚠️ As at time of writing, only a subset of available PaLM models can be accessed via MakerSuite. This makes it great for prototyping, exploration and research, but limited for production workflows across multiple tasks.

!pip install google-generativeai 
defaults = { 'model': 'models/text-bison-001', 'temperature': 0.7, 'candidate_count': 1, 'max_output_tokens': 1024, , } 
prompt = "Write a function to calculate the min of two numbers"
response = palm.generate_text( **defaults, prompt=prompt ) print(response.result)

Vertex AI Client Libraries

If you have used any of the machine learning offerings on GCP, you might be familiar with Vertex AI which aims to be a unified platform for all things machine learning. While the current branding (understandably) focuses on generative AI, the Vertex AI product historically and still spans so much more - tools to ingest data, label data, train custom models, serve models etc.

✅ As part of its tool chain, Vertex AI has client libraries in multiple languages (Python, Nodejs, Java) and there currently is support for Generative AI models. This approach is recommended as it provides access to all available models, all the benefits of a well integrated GCP product - a familiar authentication flow (e.g., using application default credentials or service account keys), inherits production guarantees for GCP services, granular association of services with billing accounts and IAM etc.

⚠️ As at time of writing, the client libraries are in preview mode, so expect changes. Also, I found that the client api’s may implement paradigms that are not fully intuitive (e.g., the send_message method in the python api) and provide limited visibility into api behavior. Finally, the client libraries may have additional install dependencies which may not be necessary for your use case (assuming installation time matters to you).

# pip install google-cloud-aiplatform
from vertexai.preview.language_models import CodeChatModel

parameters = { "temperature": 0.5, "max_output_tokens": 1024}

code_chat_model = CodeChatModel.from_pretrained("codechat-bison@001")
chat = code_chat_model.start_chat()

response = chat.send_message("Write a function to calculate the min of two numbers", **parameters)
print(f"Response from Model: {response.text}")

Vertex AI Rest API

Similar to all GCP services, you can make REST api calls to the Vertex AI LLM service. The one thing you need to take care of is to ensure you handle authentication correctly.

✅ This approach provides a good lens into api behavior allowing flexibility in development work, requires very little dependencies beyond what is needed for authentication,

⚠️ You will have to implement things like error handling, authentication which may be slightly simplified by client libraries.

POST https://us-central1-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/us-central1/publishers/google/models/codechat-bison:predict

{
  "instances": [
    { "messages": [
      {
          "content": string,
          "author": string
      }
  ],
  "parameters": {
    "temperature": TEMPERATURE,
    "maxOutputTokens": MAX_OUTPUT_TOKENS,
    "candidateCount": CANDIDATE_COUNT
  }
}

Overall, I found this approach to be the best experience - e.g., allowed access to the entire list of available models, allowed for flexibility in building a chat-based experience, whilst taking on minimal dependencies from installing client libraries.

llmx - A Library for Chat-Finetuned LLMs

Since I have been exploring PaLM api for multiple use cases, I have written an OSS (MIT license) REST API wrapper that can be reused for multiple projects - llmx. llmx provides a simple, unified interface for chat-finetuned models with support for PaLM models, Cohere, OpenAI, and OSS models via the transformers library. Sample code using llmx is shown below.

#pip install llmx
from llmx import  llm, TextGenerationConfig

palm_gen = llm(provider="palm") # will need to have setup  
palm_config = TextGenerationConfig(model="codechat-bison", temperature=0, max_tokens=50, use_cache=True)


messages = [{"role": "system", "content": "You are a helpful assistant that can explain concepts clearly to a 6 year old child."},
    {"role": "user", "content": "What is  gravity?"}]

palm_response = palm_gen.generate(messages, config=palm_config)
print(palm_response.text[0].content)

A Structured Data Extraction Use Case

For the purpose of this post, we will define structured data extraction as follows:

Structured Data Extraction.
Given some semi-structured or unstructured data (text), extract entities into a structured format (e.g., a JSON file, table or database).

This general task is interesting as it applies to practical business domains e.g.,

Hiring: Improve candidate selection by quickly identifying relevant skills, experience, and qualifications.
Legal: Legal firms and businesses can extract and analyze key data points from contracts, such as dates, terms, clauses, and parties involved, to identify potential legal risks, streamline negotiations, and improve overall contract management.
Customer Support: Automating the extraction of structured data from customer support inquiries can help identify common issues, route queries to the appropriate support agents, and improve overall support efficiency and customer satisfaction.

We will explore this task using a subset of the CMU Book Summary dataset. Each row in the dataset has a book name, genre and summary (between 500 - 5000 characters) column. Our goal is to extract a list of characters in each summary, their name, actions, gender and finally their relevance given a user’s profile.

The overall implementation process is summarized as follows:

Construct a random sample of the dataset (in the results below I use n=100)
For each summary, prompt PaLM (chat-bison) to return a JSON data structure containing structured data (see prompt snippet below).
Parse the structured data and assemble into a data frame
Post process the data frame and plot results.

The code snippet below is an example of the prompt used in the task.

# system prompt used for structured data extraction.

system_prompt="""
    You are a nice and friendly but highly skilled book analyst that can take a user's interests/profile into consideration and help write a review of a book summary to help the user. Your review should assess if the book is a match with the user's interest and why, characters in the book and their gender and main actions. You should think step by step and do your best. The goal is to help the user.

    Your RESPONSE MUST BE PERFECT JSON and MUST FOLLOW THE RULES OF JSON OBJECTS (e.g., keys must be strings, VALUES MUST NOT CONTAIN DOUBLE QUOTES etc.). YOUR RESPONSE MUST BE IN THE FORMAT BELOW.

    {"match": "yes","match_reason": "The book is a match because it is a science fiction book and the user likes science fiction books","characters" : [{ "name": "John", "actions": ["gender","male","John went to the market", "John bought a car"]}, {"gender","female","name": "Mary", "actions": ["Mary went to the market", "Mary bought a car"]} ...}
    """ 
user_profile="I enjoy crime novels .."

Example output text generated by PaLM is shown below:

{'match': 'yes',
  'match_reason': 'The book is a match because it is a crime novel and the user likes crime novels',
  'characters': [{'name': 'Harry Hole',
    'gender': 'male',
    'actions': ['Harry went to the market',
     'Harry bought a car',
     'Harry investigated a crime']},
   {'name': 'Rakel',
    'gender': 'female',
    'actions': ['Rakel met Harry',
     'Rakel talked to Harry',
     'Rakel fell in love with Harry']},
   ...
   {'name': 'Crown Prince of Norway',
    'gender': 'male',
    'actions': ['The Crown Prince of Norway was the target of an assassination attempt',
     'The Crown Prince of Norway was saved by Harry',
     "The Crown Prince of Norway's identity was revealed"]}]}

Now that we have structured data, we can then parse this as JSON to get structured data and plot the results to extract insights.
Plots of extracted data are shown below:

A grouped bar chart on extracted gender by book genre.

An example of two concrete insights we can derive from reviewing results from the model include:

On the average, Fantasy and Horror genres have more characters listed in their summary compared to other genres
Science Fiction and Historical Novels have a much higher ratio of male to female characters compared to Fantasy and Horror that appear to be more balanced. Interesting! Unsurprisingly, we find a larger percentage of unknown gender characters in Fantasy novels!

Typically, a task like structured data extraction required assembling labels and multiple training task-specific models (e.g. gender prediction models, entity extraction models, task extraction models, etc). Large language models can create business value by simplifying the implementation of such tasks (single model, no need for expensive task-specific data collection) and at a higher level of accuracy.

Model Performance - Block Rate and Task Performance

We find that about 9% of requests are blocked based on safety criteria. Thriller and Crime Fiction genres tend to trigger the safety blocks for the PaLM api at a higher percentage. Not much surprise here. Also, In 13% of the cases, the model returns invalid JSON that cannot be parsed (in some cases, the model ran out of tokens to represent the task, or simply failed to follow the syntax rules of JSON). In general, it is important to track failure reasons as they can be used to improve prompts.

⚠️ Note on dataset: The dataset and analysis above is only for illustrative purposes. Ensure you include additional checks on any production system that aims to predict protected attributes such as gender.

⚠️ Note on evaluation: This post does not cover the design of a detailed evaluation harness for the described structured data extraction task. In practive, we will need to define a labelled set of ground truth data which can be used to assess performance of multiple models + prompt variations. The harness should help answer critical questions such as:
does the model find all occurrences of characters in a summary?
does the model assign gender, extract actions etc, in a manner that is reliable or consistent? E.g., are there clear areas where errors occur?
is the rationale for suggested matching sensible?
are the extracted actions at an appropriate level of granularity and meaningful?

Finally, to explore techniques for reliability, see this article.

Practical Steps to Reduce Hallucination and Improve Performance of Systems Built with Large Language Models - Issue #11

Victor Dibia

February 14, 2023

Read full story

Developer Notes on the PaLM API

While trying out the models, there were a few important differences in how the PalM api works, say compared to the OpenAI api or OSS models available via the transformers library. These may be due to optimizations that make these models efficient to serve at scale, subtle differences in model architecture or training data composition.

✅ Citation. license , safety attributes, author. This is a unique and highly positive thing with the PaLM api. If the generated content is related to a known author, or license, book title etc, this gets included in the responses. Excellent for building apps with attribution! As far as I know, this is the only api that explores doing this and it must take quite a significant amount of engineering to make this happen. Kudos!
⚠️ Maximum number of responses. Unlike other apis where you can generate n variations of responses bounded by the max output token size, PaLM api has a strict limit on this (some models have it set to 2, others 4). For most applications, this is fine. As an alternative, you can always make additional calls, or prompt the model to return a list of responses in a single call. In addition, when temperature=0, palm will return only a single value even when requested to generate n responses.
⚠️ Alternating Message Authors: the api strictly expects alternating authors for chat based messages. In llmx, I implement a simple check for consecutive messages and merge them with a newline character.
⚠️ Blocked Responses . In some cases, the PaLM api may block responses due to safety concerns. In such cases, the response contains a dedicated blocked field and a safetyAttributes dictionary that contains a list of categories (e.g., Derogatory, Profanity etc) and scores per category. This is useful to monitor for graceful degradation in apps (e.g., offering some recommendation to the user on how to recover from the failure).
About 9% of the responses in the structured data extraction from book summaries example above were blocked.
⚠️ Prompt Sensitivity . In the example use case above (structured task extraction), the model is required to output JSON structured data in a specific format defined in the prompt. I found that the `codechat-bison` model performed significantly worse (completely failed to follow the suggested output format) compared to the `chat-bison` model. This is likely because the task is not an explicit code generation task even though the model is prompted to output JSON structured text. I also found that it was necessary to include explicit commands such as “do not include double quotes in results” to get `chat-bison` to not make that specific mistake (which invalidates JSON parsing). In contrast, a general chat model like GPT 3.5/4 can address both text and code tasks equally well, easily avoiding formatting mistakes without any special prompting.

Conclusion

With the right prompting, PaLM is a fairly capable model, with additional benefits benefits such as citations, fine grained access control via the Vertex AI GCP interface. I also found the api to be fast, with reasonable response times.

Deciding which model to use between OSS models vs hosted models like GPT/PaLM? I wrote a quick guide here.

Understanding Size Tradeoffs with Generative Models - How to Select the Right Model (GPT4 vs LLAMA2)?