Interact with Googles Gemini API with Python for free!

6 min readJul 2, 2024

Use Google's powerful vLLM on any hardware and entirely for free.

Large language models (LLMs) have taken the world by storm, revolutionizing how we interact with machines and process information. Their ability to understand and generate human language with unprecedented sophistication has led to a new wave of excitement about artificial intelligence (AI) and a massive mainstream increase in interest in a once-niche research field. Some users have even started to prefer LLMs to search for information over established search engines. However, the current generation of LLMs is not limited to language processing. Most modern models are multi-modal, meaning they can process and (in some cases also output) other modalities such as vision and audio. In this article, we will look into an LLM with vision input capabilities (vLLM) family of models: Google Gemini.

Since the long-awaited Gemini release was postponed multiple times and the performance did not match public expectations I found that many developers were not aware that Google offers free API access (within usage limits) to their models. While Google's models may not match (the paid) GPT-4 models, they are nonetheless strong models suitable for many tasks — and they are hosted by Google so you can prompt them on any hardware you want. In this article, I want to show you how to set up your API Key and send your first prompts from Python!

Google Gemini. Image by Google under public license (https://commons.wikimedia.org/wiki/File:Google_Gemini.webp).

Preliminary: How free is free?

Running vLLMs is expensive and understandingly Google rate limits access. For the most recent limits I refer you to https://ai.google.dev/pricing as the limits depend on your location, and the utilized model, and may change from time to time. But as an overview, at the time of writing most models were limited to 15 RPM (requests per minute), 1 million TPM (tokens per minute), and 1,500 RPD (requests per day) for me. If you exceed the limits Google simply blocks your prompt but you will not be charged. There is, however, one little caveat: you “pay” with your data. If you use the free tier you agree that your prompts and generated responses may be used to improve Google's products. Currently, the free tier is not available if your request is sent from the EEA (including EU), the UK, or CH. If you send a request from these countries you will receive error 400 — or simply use a VPN.

How do I use the Gemini API?

Unfortunately, Google made access to Gemini a bit confusing, so let me give you a quick overview of the ways I know of:

https://gemini.google.com/app allows you to interact with the web UI but not API!
Vertex AI (https://cloud.google.com/vertex-ai) is the commercial API — this is NOT what you are looking for!
Google AI Studio (https://aistudio.google.com/) is what you want! It also allows you to interact with the models without API access if you want to play around!

Step 1: Generate an API Key

Go to https://aistudio.google.com/app/apikey and generate an API key, if you do not have a Google Cloud account or project you must first create one. It may look scary but you have to go through this. Note that you will not be billed unless you click “Set up billing” even if billing is enabled in Google Cloud. Make sure to never share your API key and ensure that it is excluded from your public repository in GitHub. For example, a good way to not accidentally leak your API key is to save and access it from an environment variable. Under Linux/Mac OS you can use export GEMINI_API_KEY=… in your terminal. You do not need to remember your API key, you can always look it up in AI Studio.

Overview of API key generation in Google AI studio.

Step 2: Pick a model

At the time of writing the Gemini API had the following (vLLM) model endpoints available

gemini-1.5-pro (2M token context window)
gemini-1.5-flash (1M token context window)
gemini-1.0-pro-vision-latest (12k token context window; note that this model only accepts multi-modal prompts — text-only will be rejected)

I do not think there is any reason for using Gemini 1.0 anymore so you likely have to pick between 1.5 Pro and Flash. Pro gives you double the context window and is overall the best model but is also slower and has higher rate limits. Flash should be the best option for most users.

Step 3: Install the library

The Gemini API has a REST interface, which you could access via HTTP requests from curl etc. but I think most people will be interested in Python so let’s skip to that. While you could also send an HTTP request from Python, Google offers a simpler way via their own Python module: Google AI Python SDK for the Gemini API. You can install it via pip: pip install google-generativeai.
The API is also available for Swift, Kotlin, and JavaScript.

Step 4: Make your API call!

The simplest interaction with Gemini is to send a single prompt that the model answers and then automatically ends the conversation. Here is what that could look like:

import google.generativeai as genai
import os

genai.configure(api_key=os.environ["GEMINI_API_KEY"])  # get's your key

model = genai.GenerativeModel(model_name="gemini-1.5-flash")  # replace with your model

prompt_parts = [
  genai.upload_file("test.jpeg"),  # pass the path to your image
  "Describe the image.",  # text prompt (can be before, after, or interleaved)
]

response = model.generate_content(prompt_parts)  # the actual call
print(response.text)

Note that the text prompt can be any text. Gemini supports many tasks and describing an image is arguably one of the simpler ones. Images are passed as file paths and uploaded to Google servers for model inference.

So how do you deal with cases where you have follow-up prompts (e.g., you want to ask the model about details in the image)? In that case, you have to retain a local history of all prompts and responses and keep sending that for the next generation request. You could do this by storing a list in a special format but Google has a convenient helper ready for you. You can create a “chat session” that automatically maintains your history and prepends it to every new prompt:

chat_session = model.start_chat()

response = chat_session.send_message(prompt_parts) 
print(response.text)  # first answer

response = chat_session.send_message("<Your follow up prompt>") 
print(response.text) # second answer
print(chat_session.history)  # print the entire history (this is passed to generate_content)
...

From time to time, you may experience that Gemini refuses to respond. This may be due to a safety system that is triggered. For instance, I found that any prompts containing “knife” in whatever context were rejected by Gemini 1.0 Pro Vision. Unfortunately, some of the error messages are hard to understand or contain no information so you may not even know if it is a safety system that caused the issues. You can, however, configure Gemini safety levels (also see https://ai.google.dev/gemini-api/docs/safety-settings). To turn them off entirely pass the following config to the generate_content() or send_messagecalls. This should fix most “weird” errors.

safety_settings = [
    {"category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_NONE"},
    {"category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_NONE"},
    {"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", "threshold": "BLOCK_NONE"},
    {"category": "HARM_CATEGORY_DANGEROUS_CONTENT", "threshold": "BLOCK_NONE"},
]

response = model.generate_content(prompt_parts, safety_settings=safety_settings)

If you want more control, you can tweak generation configs. For the sake of this article, I will not explain the parameters in detail as they are common LLM generation parameters but refer you to this article instead. Pass these parameters as another config like the safety_settings to apply them:

generation_config = {
  "temperature": 1,  # "randomness of the output". 0=deterministic
  "top_p": 0.95,  # only sample from tokens with p > top_p
  "top_k": 64,  # only sample from top_k samples
  "max_output_tokens": 8192  # number of max tokens to generate
}

response = model.generate_content(prompt_parts, 
                                  safety_settings=safety_settings, 
                                  generation_config=generation_config)

Lastly, I want to add that you can build a prompt directly from Google AI Studio and then export the code to any language. This can be a good starting direction but I often found that the generated code did not really work. If you have any questions feel free to ask here and I will try to answer them!

Thank you for reading this article! If you enjoyed it please consider subscribing to my updates. If you have any questions or suggestions feel free to leave them in the comments.