Greetings readers,
With Andre’s recent report on the Data-Driven VC Landscape (thanks again for the feature) it became even more evident that adopting a data-driven approach has become increasingly important. However, despite the desire of many VC firms to become more data-driven, only a small percentage have dedicated teams and initiatives in place. Also, there is a huge lack of accessible resources in this area, leaving everyone reinventing the wheel or not even knowing where to start. At Lunar we are big on a tech-first approach to VC so this blog post aims to bridge that gap by providing an introductory guide for individuals, solo GPs, and microfunds who want to embark on the journey of becoming more data-driven in their VC practices and still have a greenfield in front of them. We will focus on building a technical minimal viable product (MVP) to kickstart your data-driven transformation.
Why Data-Driven VC Matters
Data-driven VC involves leveraging technology, analytics, and data to gain insights and make informed investment decisions. By incorporating data into the investment process, VC firms can uncover patterns, identify trends, and enhance their decision-making capabilities. This not only helps in improving investment performance but also enables VCs to identify promising startups, optimize deal flow, and manage portfolio companies more effectively. Maybe most importantly, Engineering activities scale well and have been found to strongly increase assets under management (AUM) per engineer in larger funds.
Stop talking about data driven - start building!
While adopting a data-driven approach in VC requires a cultural shift and the establishment of new processes, this blog post will focus on actually building a technical MVP. I recognise the lack of accessible resources in this area, and hence, aim to provide a starting point for individuals and smaller VC entities (who have a massive benefit as they have a greenfield and modern tooling at their disposal. No more Zapier, no Apache Airflow - I promise). By building a technical MVP, you can lay the foundation for a scalable and data-driven VC practice.
Setting Up Your Minimum Viable Data Stack
To begin your data-driven VC journey, we will guide you through setting up a basic data stack using modern cloud-native tools. We will use Google BigQuery as our data warehouse, Airtable as our CRM system for deal flow, and Google Sheets as a data source. However, keep in mind that these tools are interchangeable with others such as Snowflake, HubSpot, or ClickUp. For ease, scalability, and composability, we will utilize modern tooling.
In the next hour, you will:
1. Stand up a Google project with Google BigQuery.
2. Integrate data from Airtable and Google Sheets into your data warehouse.
3. Write a SQL query to surface deal which are not yet in your deal flow CRM
4. Establish basic tooling and processes for a repeatable analytical workflow.
Requirements:
To follow along, you will need the following:
- A computer with Python 3 installed (if not already, download it from python.org/downloads).
- A Google account.
- A Google Cloud Platform account.
- An Airtable account.
- An Airbyte account (you can use the free trial, or alternatively, run it locally with Docker).
Let's dive into the steps to set up your minimal viable VC data stack.
Tutorial: Move from Spreadsheets to SQL
Step 1: Install the Google Cloud CLI
Let’s get our hands dirty. We are going to use Google Bigquery and thus will need to install the gcloud cli to interface with the Google Cloud Platform. If you're using a Mac, you can use Homebrew to install the gcloud CLI by running the following command in your terminal:
$ brew install --cask google-cloud-sdk
If you're on a different operating system, follow the installation instructions provided on the [Google Cloud SDK documentation](https://cloud.google.com/sdk/docs/install#installation_instructions).
Once installed, authenticate your gcloud CLI by running:
gcloud auth login --brief
Step 2: Create a New Project on GCP
To create a new project on Google Cloud Platform (GCP) using the gcloud CLI, follow these steps:
1. Create a new Google Cloud project called "mydatamvp":
gcloud projects create mydatamvp --name="My Data MVP"
2. Set the created project as the default project:
gcloud config set project mydatamvp
3. Activate the Google BigQuery API:
gcloud services enable bigquery.googleapis.com
4. Create a dataset called "mydataset":
gcloud bigquery datasets create mydataset
Step 3: Creating a Service Account
To allow Airbyte to sync your data to BigQuery, you need to create a service account with the appropriate permissions. Follow these commands:
gcloud iam service-accounts create my-service-account \
--display-name "My Service Account"
Now let’s add the permissions needed:
gcloud projects add-iam-policy-binding mydatamvp --member "serviceAccount:my-service-account@mydatamvp.iam.gserviceaccount.com" --role "roles/bigquery.dataEditor" --role "roles/bigquery.dataViewer" --role "roles/bigquery.jobUser" --role "roles/bigquery.metadataViewer" --role "roles/bigquery.user"
Note: Replace `my-service-account` with the desired name for your service account.
Step 4: Download Your Service Account Credentials
Save your service account credentials as a JSON file. Replace `~/path/to/key.json` with the path where you want to store the credentials file.
gcloud iam service-accounts keys create ~/path/to/key.json \
--iam-account=my-service-account@mydatamvp.iam.gserviceaccount.com
We will need this file later for Airbyte configuration.
Step 5: Copy and explore the example Data
Make a copy of the following Google Sheets into your own Google Sheets account: My-Data-MVP-Example-Gsheet
Explore the sheet. As you can see this is an export of the CNCF Landscape. We are using this as an example data source for deal flow. Candidate Selection is a whole series of posts in itself so we are skipping this for now - the CNCF Landscape could be a good first start if you are investing into software infrastructure startups.
Next, use this deal flow CRM template from Airtable: Airtable Deal Flow CRM Template. In our tutorial we will use this to mimic our CRM. Explore the CRM and find the Company Tab. As you can see there are only a few companies in there yet.
Step 6: Setup Airbyte
To sync data from Airtable and Google Sheets to BigQuery, we will now set up Airbyte. Follow these steps:
1. Visit airbyte.com and create a trial account. Alternatively, you can run Airbyte locally with Docker using the instructions provided [here](https://docs.airbyte.com/deploying-airbyte/local-deployment/).
2. Create two sources and one destination in Airbyte.
Step 6.1: Setting Up the Sources
1. Log into your Airbyte account and click on the "+ Source" button in the top-right corner.
2. Select "Google Sheets API" as the source and authorize Airbyte to access your Google Sheets by following the prompts.
3. Enter your Sheet ID, which can be found in the URL of your Google Sheet.
4. Provide the range of data you want to sync. If you want to sync all rows and columns, enter 'Sheet1' or the name of
the sheet you wish to sync.
5. Click "Check Connection." If successful, click "Set up Source."
6. Repeat the same steps to add Airtable as a source. You'll need your Airtable API key, which can be found in your account settings. Enter the name of your base and table after entering the API key.
Step 6.2: Setting Up the Destination
1. Click "+ Destination" on your Airbyte dashboard.
2. Choose "Google BigQuery" as the destination. Paste the Service Account Key you saved in Step 4 into the Service Account JSON field.
3. Enter the Project ID, Dataset ID (created in Part 2), and Dataset Region (based on your geographical location).
4. Click "Check Connection." If successful, click "Set up Destination."
Step 7: Syncing the Data
Configure the sync frequency for your data sources in Airbyte.
1. Go to the "Connections" tab on the Airbyte dashboard and click "+ New Connection."
2. Choose your sources and destination and set the desired sync frequency.
3. Click "Set Up Connection."
Airbyte will now periodically sync the data from your Google Sheets and Airtable to your Google BigQuery data warehouse.
Step 8: Write your first SQL Query
With your data in place, you are ready to write SQL queries in Google BigQuery.
To achieve our goal of finding all CNCF projects with less than $5M funding that are not in our CRM, use the following SQL query as an example:
SELECT project_name, funding
FROM `mydatamvp.mydataset.CNCF`
WHERE funding < 5000000
AND project_name NOT IN (
SELECT project_name
FROM `mydatamvp.mydataset.CRM`
)
Replace `mydatamvp.mydataset.CNCF` with your Google BigQuery table name for the CNCF data and `mydatamvp.mydataset.CRM` with your Google BigQuery table name of the companies sheet in your Airtable CRM.
Hold on, but where do we run the query? You could run the query in google bigquery cloud UI. There are also many tools like Tableau, Looker etc in which you can bind SQL queries to charts and tables. For now we execute the query by using the bigquery CLI which you conveniently installed automatically along side the gcloud CLI.
Run the following in your terminal and you will obtain a list of all CNCF projects with less than $5M funding that are not in your CRM
bq query --nouse_legacy_sql '
SELECT project_name, funding
FROM `mydatamvp.mydataset.CNCF`
WHERE funding < 5000000
AND project_name NOT IN (
SELECT project_name
FROM `mydatamvp.mydataset.CRM`
This is just a trivial example trying to demonstrate an end to end integration. Naturally company names are weak unique identifiers. Now, how about you try using a web url instead on your own? Hint: First go to airtable and add a url column in the company sheet
Step 9: Reproduce
Try adding Akri to the airtable CRM. Make sure to name the company exactly the same (Akri) Head over to airbyte and trigger a new sync. If you rerun the above query now you will not see Akri anymore as it is already in your CRM. 🥇
Conclusion:
Congratulations! You have successfully set up your minimal viable VC data stack, transitioning from spreadsheets to SQL. This is just the beginning of your data-driven VC journey. Remember, becoming a data-driven VC is a continuous journey rather than a destination. Embrace the process and enjoy the benefits of data-powered decision-making& lease let me know what you would like to see in future posts.