GitHub Discovery
This documentation is written for the old backend which has been replaced by the new backend system, being the default since Backstage version 1.24. If have migrated to the new backend system, you may want to read its own article instead. Otherwise, consider migrating!
GitHub Provider
The GitHub integration has a discovery provider for discovering catalog entities within a GitHub organization. The provider will crawl the GitHub organization and register entities matching the configured path. This can be useful as an alternative to static locations or manually adding things to the catalog. This is the preferred method for ingesting entities into the catalog.
Installation without Events Support
You will have to add the provider in the catalog initialization code of your
backend. They are not installed by default, therefore you have to add a
dependency on @backstage/plugin-catalog-backend-module-github
to your backend
package.
yarn --cwd packages/backend add @backstage/plugin-catalog-backend-module-github
And then add the entity provider to your catalog builder:
import { GithubEntityProvider } from '@backstage/plugin-catalog-backend-module-github';
export default async function createPlugin(
env: PluginEnvironment,
): Promise<Router> {
const builder = await CatalogBuilder.create(env);
builder.addEntityProvider(
GithubEntityProvider.fromConfig(env.config, {
logger: env.logger,
scheduler: env.scheduler,
}),
);
// ..
}
Installation with Events Support
For the legacy backend system, please read the sub-section below.
The catalog module for GitHub comes with events support enabled.
This will make it subscribe to its relevant topics (github.push
)
and expects these events to be published via the EventsService
.
Additionally, you should install the
event router by events-backend-module-github
which will route received events from the generic topic github
to more specific ones
based on the event type (e.g., github.push
).
In order to receive Webhook events by GitHub, you have to decide how you want them
to be ingested into Backstage and published to its EventsService
.
You can decide between the following options (extensible):
Legacy Backend System
Please follow the installation instructions at
- https://github.com/backstage/backstage/tree/master/plugins/events-backend/README.md
- https://github.com/backstage/backstage/tree/master/plugins/events-backend-module-github/README.md
Additionally, you need to decide how you want to receive events from external sources like
Set up your provider
import { CatalogBuilder } from '@backstage/plugin-catalog-backend';
import { GithubEntityProvider } from '@backstage/plugin-catalog-backend-module-github';
import { ScaffolderEntitiesProcessor } from '@backstage/plugin-scaffolder-backend';
import { Router } from 'express';
import { PluginEnvironment } from '../types';
export default async function createPlugin(
env: PluginEnvironment,
): Promise<Router> {
const builder = await CatalogBuilder.create(env);
builder.addProcessor(new ScaffolderEntitiesProcessor());
const githubProvider = GithubEntityProvider.fromConfig(env.config, {
events: env.events,
logger: env.logger,
scheduler: env.scheduler,
});
builder.addEntityProvider(githubProvider);
const { processingEngine, router } = await builder.build();
await processingEngine.start();
return router;
}
You can check the official docs to configure your webhook and to secure your request. The webhook will need to be configured to forward push
events.
Configuration
To use the discovery provider, you'll need a GitHub integration
set up with either a Personal Access Token or GitHub Apps. For Personal Access Tokens you should pay attention to the required scopes, where you will need at least the repo
scope for reading components. For GitHub Apps you will need to grant it the required permissions instead, where you will need at least the Contents: Read-only
permissions for reading components.
Then you can add a github
config to the catalog providers configuration:
catalog:
providers:
github:
# the provider ID can be any camelCase string
providerId:
organization: 'backstage' # string
catalogPath: '/catalog-info.yaml' # string
filters:
branch: 'main' # string
repository: '.*' # Regex
schedule: # same options as in SchedulerServiceTaskScheduleDefinition
# supports cron, ISO duration, "human duration" as used in code
frequency: { minutes: 30 }
# supports ISO duration, "human duration" as used in code
timeout: { minutes: 3 }
customProviderId:
organization: 'new-org' # string
catalogPath: '/custom/path/catalog-info.yaml' # string
filters: # optional filters
branch: 'develop' # optional string
repository: '.*' # optional Regex
wildcardProviderId:
organization: 'new-org' # string
catalogPath: '/groups/**/*.yaml' # this will search all folders for files that end in .yaml
filters: # optional filters
branch: 'develop' # optional string
repository: '.*' # optional Regex
topicProviderId:
organization: 'backstage' # string
catalogPath: '/catalog-info.yaml' # string
filters:
branch: 'main' # string
repository: '.*' # Regex
topic: 'backstage-exclude' # optional string
topicFilterProviderId:
organization: 'backstage' # string
catalogPath: '/catalog-info.yaml' # string
filters:
branch: 'main' # string
repository: '.*' # Regex
topic:
include: ['backstage-include'] # optional array of strings
exclude: ['experiments'] # optional array of strings
validateLocationsExist:
organization: 'backstage' # string
catalogPath: '/catalog-info.yaml' # string
filters:
branch: 'main' # string
repository: '.*' # Regex
validateLocationsExist: true # optional boolean
visibilityProviderId:
organization: 'backstage' # string
catalogPath: '/catalog-info.yaml' # string
filters:
visibility:
- public
- internal
enterpriseProviderId:
host: ghe.example.net
organization: 'backstage' # string
catalogPath: '/catalog-info.yaml' # string
This provider supports multiple organizations via unique provider IDs.
Note: It is possible but certainly not recommended to skip the provider ID level. If you do so,
default
will be used as provider ID.
catalogPath
(optional): Default:/catalog-info.yaml
. Path where to look forcatalog-info.yaml
files. You can use wildcards -*
or**
- to search the path and/or the filename. Wildcards cannot be used if thevalidateLocationsExist
option is set totrue
.filters
(optional):branch
(optional): String used to filter results based on the branch name. Defaults to the default Branch of the repository.repository
(optional): Regular expression used to filter results based on the repository name.topic
(optional): Both of the filters below may be used at the same time but the exclusion filter has the highest priority. In the example above, a repository with thebackstage-include
topic would still be excluded if it were also carrying theexperiments
topic.include
(optional): An array of strings used to filter in results based on their associated GitHub topics. If configured, only repositories with one (or more) topic(s) present in the inclusion filter will be ingestedexclude
(optional): An array of strings used to filter out results based on their associated GitHub topics. If configured, all repositories except those with one (or more) topics(s) present in the exclusion filter will be ingested.
visibility
(optional): An array of strings used to filter results based on their visibility. Available options areprivate
,internal
,public
. If configured (non empty), only repositories with visibility present in the filter will be ingested
host
(optional): The hostname of your GitHub Enterprise instance. It must match a host defined in integrations.github.organization
: Name of your organization account/workspace. If you want to add multiple organizations, you need to add one provider config each.validateLocationsExist
(optional): Whether to validate locations that exist before emitting them. This option avoids generating locations for catalog info files that do not exist in the source repository. Defaults tofalse
. Due to limitations in the GitHub API's ability to query for repository objects, this option cannot be used in conjunction with wildcards in thecatalogPath
.schedule
:frequency
: How often you want the task to run. The system does its best to avoid overlapping invocations.timeout
: The maximum amount of time that a single task invocation can take.initialDelay
(optional): The amount of time that should pass before the first invocation happens.scope
(optional):'global'
or'local'
. Sets the scope of concurrency control.
GitHub API Rate Limits
GitHub rate limits API requests to 5,000 per hour (or more for Enterprise accounts). The snippet below refreshes the Backstage catalog data every 35 minutes, which issues an API request for each discovered location.
If your requests are too frequent then you may get throttled by
rate limiting. You can change the refresh frequency of the catalog in your app-config.yaml
file by controlling the schedule
.
schedule:
frequency: { minutes: 35 }
timeout: { minutes: 3 }
More information about scheduling can be found on the SchedulerServiceTaskScheduleDefinition page.
Alternatively, or additionally, you can configure github-apps authentication which carries a much higher rate limit at GitHub.
This is true for any method of adding GitHub entities to the catalog, but especially easy to hit with automatic discovery.
GitHub Processor (To Be Deprecated)
The GitHub integration has a special discovery processor for discovering catalog entities within a GitHub organization. The processor will crawl the GitHub organization and register entities matching the configured path. This can be useful as an alternative to static locations or manually adding things to the catalog.
Installation
You will have to add the processors in the catalog initialization code of your
backend. They are not installed by default, therefore you have to add a
dependency on @backstage/plugin-catalog-backend-module-github
to your backend
package, plus @backstage/integration
for the basic credentials management:
yarn --cwd packages/backend add @backstage/integration @backstage/plugin-catalog-backend-module-github
And then add the processors to your catalog builder:
import {
GithubDiscoveryProcessor,
GithubOrgReaderProcessor,
} from '@backstage/plugin-catalog-backend-module-github';
import {
ScmIntegrations,
DefaultGithubCredentialsProvider,
} from '@backstage/integration';
export default async function createPlugin(
env: PluginEnvironment,
): Promise<Router> {
const builder = await CatalogBuilder.create(env);
const integrations = ScmIntegrations.fromConfig(env.config);
const githubCredentialsProvider =
DefaultGithubCredentialsProvider.fromIntegrations(integrations);
builder.addProcessor(
GithubDiscoveryProcessor.fromConfig(env.config, {
logger: env.logger,
githubCredentialsProvider,
}),
GithubOrgReaderProcessor.fromConfig(env.config, {
logger: env.logger,
githubCredentialsProvider,
}),
);
// ..
}
Configuration
To use the discovery processor, you'll need a GitHub integration set up with either a Personal Access Token or GitHub Apps.
Then you can add a location target to the catalog configuration:
catalog:
locations:
# (since 0.13.5) Scan all repositories for a catalog-info.yaml in the root of the default branch
- type: github-discovery
target: https://github.com/myorg
# Or use a custom pattern for a subset of all repositories with default repository
- type: github-discovery
target: https://github.com/myorg/service-*/blob/-/catalog-info.yaml
# Or use a custom file format and location
- type: github-discovery
target: https://github.com/*/blob/-/docs/your-own-format.yaml
# Or use a specific branch-name
- type: github-discovery
target: https://github.com/*/blob/backstage-docs/catalog-info.yaml
Note the github-discovery
type, as this is not a regular url
processor.
When using a custom pattern, the target is composed of three parts:
- The base organization URL,
https://github.com/myorg
in this case - The repository blob to scan, which accepts * wildcard tokens. This can simply
be
*
to scan all repositories in the organization. This example only looks for repositories prefixed withservice-
. - The path within each repository to find the catalog YAML file. This will
usually be
/blob/main/catalog-info.yaml
,/blob/master/catalog-info.yaml
or a similar variation for catalog files stored in the root directory of each repository. You could also use a dash (-
) for referring to the default branch.
GitHub API Rate Limits
GitHub rate limits API requests to 5,000 per hour (or more for Enterprise accounts). The default Backstage catalog backend refreshes data every 100 seconds, which issues an API request for each discovered location.
This means if you have more than ~140 catalog entities, you may get throttled by
rate limiting. You can change the refresh rate of the catalog in your packages/backend/src/plugins/catalog.ts
file:
const builder = await CatalogBuilder.create(env);
// For example, to refresh every 5 minutes (300 seconds).
builder.setProcessingIntervalSeconds(300);
Alternatively, or additionally, you can configure github-apps authentication which carries a much higher rate limit at GitHub.
This is true for any method of adding GitHub entities to the catalog, but especially easy to hit with automatic discovery.