Skip to main content

GitHub Discovery

GitHub Provider

The GitHub integration has a discovery provider for discovering catalog entities within a GitHub organization. The provider will crawl the GitHub organization and register entities matching the configured path. This can be useful as an alternative to static locations or manually adding things to the catalog. This is the preferred method for ingesting entities into the catalog.

Installation without Events Support

You will have to add the provider in the catalog initialization code of your backend. They are not installed by default, therefore you have to add a dependency on @backstage/plugin-catalog-backend-module-github to your backend package.

# From your Backstage root directory
yarn --cwd packages/backend add @backstage/plugin-catalog-backend-module-github

And then update your backend by adding the following line:

packages/backend/src/index.ts
// github discovery
backend.add(import('@backstage/plugin-catalog-backend-module-github/alpha'));

Installation with Events Support

For the legacy backend system, please read the sub-section below.

The catalog module for GitHub comes with events support enabled. This will make it subscribe to its relevant topics (github.push) and expects these events to be published via the EventsService.

Additionally, you should install the event router by events-backend-module-github which will route received events from the generic topic github to more specific ones based on the event type (e.g., github.push).

In order to receive Webhook events by GitHub, you have to decide how you want them to be ingested into Backstage and published to its EventsService. You can decide between the following options (extensible):

Legacy Backend System

Please follow the installation instructions at

Additionally, you need to decide how you want to receive events from external sources like

Set up your provider

packages/backend/src/plugins/catalog.ts
import { CatalogBuilder } from '@backstage/plugin-catalog-backend';
import { GithubEntityProvider } from '@backstage/plugin-catalog-backend-module-github';
import { ScaffolderEntitiesProcessor } from '@backstage/plugin-scaffolder-backend';
import { Router } from 'express';
import { PluginEnvironment } from '../types';

export default async function createPlugin(
env: PluginEnvironment,
): Promise<Router> {
const builder = await CatalogBuilder.create(env);
builder.addProcessor(new ScaffolderEntitiesProcessor());
const githubProvider = GithubEntityProvider.fromConfig(env.config, {
events: env.events,
logger: env.logger,
scheduler: env.scheduler,
});
builder.addEntityProvider(githubProvider);
const { processingEngine, router } = await builder.build();
await processingEngine.start();
return router;
}

You can check the official docs to configure your webhook and to secure your request. The webhook will need to be configured to forward push events.

Configuration

To use the discovery provider, you'll need a GitHub integration set up with either a Personal Access Token or GitHub Apps. For Personal Access Tokens you should pay attention to the required scopes, where you will need at least the repo scope for reading components. For GitHub Apps you will need to grant it the required permissions instead, where you will need at least the Contents: Read-only permissions for reading components.

Then you can add a github config to the catalog providers configuration:

catalog:
providers:
github:
# the provider ID can be any camelCase string
providerId:
organization: 'backstage' # string
catalogPath: '/catalog-info.yaml' # string
filters:
branch: 'main' # string
repository: '.*' # Regex
schedule: # same options as in TaskScheduleDefinition
# supports cron, ISO duration, "human duration" as used in code
frequency: { minutes: 30 }
# supports ISO duration, "human duration" as used in code
timeout: { minutes: 3 }
customProviderId:
organization: 'new-org' # string
catalogPath: '/custom/path/catalog-info.yaml' # string
filters: # optional filters
branch: 'develop' # optional string
repository: '.*' # optional Regex
wildcardProviderId:
organization: 'new-org' # string
catalogPath: '/groups/**/*.yaml' # this will search all folders for files that end in .yaml
filters: # optional filters
branch: 'develop' # optional string
repository: '.*' # optional Regex
topicProviderId:
organization: 'backstage' # string
catalogPath: '/catalog-info.yaml' # string
filters:
branch: 'main' # string
repository: '.*' # Regex
topic: 'backstage-exclude' # optional string
topicFilterProviderId:
organization: 'backstage' # string
catalogPath: '/catalog-info.yaml' # string
filters:
branch: 'main' # string
repository: '.*' # Regex
topic:
include: ['backstage-include'] # optional array of strings
exclude: ['experiments'] # optional array of strings
validateLocationsExist:
organization: 'backstage' # string
catalogPath: '/catalog-info.yaml' # string
filters:
branch: 'main' # string
repository: '.*' # Regex
validateLocationsExist: true # optional boolean
visibilityProviderId:
organization: 'backstage' # string
catalogPath: '/catalog-info.yaml' # string
filters:
visibility:
- public
- internal
enterpriseProviderId:
host: ghe.example.net
organization: 'backstage' # string
catalogPath: '/catalog-info.yaml' # string

This provider supports multiple organizations via unique provider IDs.

Note: It is possible but certainly not recommended to skip the provider ID level. If you do so, default will be used as provider ID.

  • catalogPath (optional): Default: /catalog-info.yaml. Path where to look for catalog-info.yaml files. You can use wildcards - * or ** - to search the path and/or the filename. Wildcards cannot be used if the validateLocationsExist option is set to true.
  • filters (optional):
    • branch (optional): String used to filter results based on the branch name.
    • repository (optional): Regular expression used to filter results based on the repository name.
    • topic (optional): Both of the filters below may be used at the same time but the exclusion filter has the highest priority. In the example above, a repository with the backstage-include topic would still be excluded if it were also carrying the experiments topic.
      • include (optional): An array of strings used to filter in results based on their associated GitHub topics. If configured, only repositories with one (or more) topic(s) present in the inclusion filter will be ingested
      • exclude (optional): An array of strings used to filter out results based on their associated GitHub topics. If configured, all repositories except those with one (or more) topics(s) present in the exclusion filter will be ingested.
    • visibility (optional): An array of strings used to filter results based on their visibility. Available options are private, internal, public. If configured (non empty), only repositories with visibility present in the filter will be ingested
  • host (optional): The hostname of your GitHub Enterprise instance. It must match a host defined in integrations.github.
  • organization: Name of your organization account/workspace. If you want to add multiple organizations, you need to add one provider config each.
  • validateLocationsExist (optional): Whether to validate locations that exist before emitting them. This option avoids generating locations for catalog info files that do not exist in the source repository. Defaults to false. Due to limitations in the GitHub API's ability to query for repository objects, this option cannot be used in conjunction with wildcards in the catalogPath.
  • schedule:
    • frequency: How often you want the task to run. The system does its best to avoid overlapping invocations.
    • timeout: The maximum amount of time that a single task invocation can take.
    • initialDelay (optional): The amount of time that should pass before the first invocation happens.
    • scope (optional): 'global' or 'local'. Sets the scope of concurrency control.

GitHub API Rate Limits

GitHub rate limits API requests to 5,000 per hour (or more for Enterprise accounts). The snippet below refreshes the Backstage catalog data every 35 minutes, which issues an API request for each discovered location.

If your requests are too frequent then you may get throttled by rate limiting. You can change the refresh frequency of the catalog in your app-config.yaml file by controlling the schedule.

schedule:
frequency: { minutes: 35 }
timeout: { minutes: 3 }

More information about scheduling can be found on the TaskScheduleDefinition page.

Alternatively, or additionally, you can configure github-apps authentication which carries a much higher rate limit at GitHub.

This is true for any method of adding GitHub entities to the catalog, but especially easy to hit with automatic discovery.

GitHub Processor (To Be Deprecated)

The GitHub integration has a special discovery processor for discovering catalog entities within a GitHub organization. The processor will crawl the GitHub organization and register entities matching the configured path. This can be useful as an alternative to static locations or manually adding things to the catalog.

Installation

You will have to add the processors in the catalog initialization code of your backend. They are not installed by default, therefore you have to add a dependency on @backstage/plugin-catalog-backend-module-github to your backend package, plus @backstage/integration for the basic credentials management:

# From your Backstage root directory
yarn --cwd packages/backend add @backstage/integration @backstage/plugin-catalog-backend-module-github

And then add the processors to your catalog builder:

packages/backend/src/plugins/catalog.ts
import {
GithubDiscoveryProcessor,
GithubOrgReaderProcessor,
} from '@backstage/plugin-catalog-backend-module-github';
import {
ScmIntegrations,
DefaultGithubCredentialsProvider,
} from '@backstage/integration';

export default async function createPlugin(
env: PluginEnvironment,
): Promise<Router> {
const builder = await CatalogBuilder.create(env);
const integrations = ScmIntegrations.fromConfig(env.config);
const githubCredentialsProvider =
DefaultGithubCredentialsProvider.fromIntegrations(integrations);
builder.addProcessor(
GithubDiscoveryProcessor.fromConfig(env.config, {
logger: env.logger,
githubCredentialsProvider,
}),
GithubOrgReaderProcessor.fromConfig(env.config, {
logger: env.logger,
githubCredentialsProvider,
}),
);

// ..
}

Configuration

To use the discovery processor, you'll need a GitHub integration set up with either a Personal Access Token or GitHub Apps.

Then you can add a location target to the catalog configuration:

catalog:
locations:
# (since 0.13.5) Scan all repositories for a catalog-info.yaml in the root of the default branch
- type: github-discovery
target: https://github.com/myorg
# Or use a custom pattern for a subset of all repositories with default repository
- type: github-discovery
target: https://github.com/myorg/service-*/blob/-/catalog-info.yaml
# Or use a custom file format and location
- type: github-discovery
target: https://github.com/*/blob/-/docs/your-own-format.yaml
# Or use a specific branch-name
- type: github-discovery
target: https://github.com/*/blob/backstage-docs/catalog-info.yaml

Note the github-discovery type, as this is not a regular url processor.

When using a custom pattern, the target is composed of three parts:

  • The base organization URL, https://github.com/myorg in this case
  • The repository blob to scan, which accepts * wildcard tokens. This can simply be * to scan all repositories in the organization. This example only looks for repositories prefixed with service-.
  • The path within each repository to find the catalog YAML file. This will usually be /blob/main/catalog-info.yaml, /blob/master/catalog-info.yaml or a similar variation for catalog files stored in the root directory of each repository. You could also use a dash (-) for referring to the default branch.

GitHub API Rate Limits

GitHub rate limits API requests to 5,000 per hour (or more for Enterprise accounts). The default Backstage catalog backend refreshes data every 100 seconds, which issues an API request for each discovered location.

This means if you have more than ~140 catalog entities, you may get throttled by rate limiting. You can change the refresh rate of the catalog in your packages/backend/src/plugins/catalog.ts file:

const builder = await CatalogBuilder.create(env);

// For example, to refresh every 5 minutes (300 seconds).
builder.setProcessingIntervalSeconds(300);

Alternatively, or additionally, you can configure github-apps authentication which carries a much higher rate limit at GitHub.

This is true for any method of adding GitHub entities to the catalog, but especially easy to hit with automatic discovery.