From zero to type-safe: How we brought static type checking to large-scale Python codebase

From zero to type-safe: How we brought static type checking to large-scale Python codebase

20 min read

We introduced static type checking into a rapidly growing, 4M+ line Python monorepo with daily production deployments.

Using a combination of gradual typing, automated type inference powered by MonkeyType, a purpose-built mypy configuration system, and a two-tier CI integration, we went from no static analysis to blocking type errors at the pull request level, without requiring a big-bang annotation effort from every team.

The Challenge: Scaling Confidence Alongside Code

Large Python codebases have a well-known tension: the language’s dynamic nature makes it fast to write but easy to introduce subtle contract violations between modules. A function that returns None in an edge case, a string passed where an integer is expected, a missing attribute on an unexpected type: these bugs are invisible until runtime.

Consider a simple example:

def get_user_role(user_id):
    user = load_user(user_id)
    if not user:
        return None
    return user.role

Two bugs lurk here that no unit test within this function would catch: load_user might expect an int but callers pass a str, and downstream code doing “Role: ” + get_user_role(…) will crash when the function returns None. These are interface-level bugs: mismatches between what a function promises and what its callers assume.

As our Python backend grew past 4M lines, the surface area for these kinds of issues grew combinatorially. Unit tests check behavior within a function’s contract. Linters catch surface-level issues in the AST. But neither traces flows of data across module boundaries to verify that callers and callees agree on types.

Static type checkers like mypy fill exactly this gap. They parse function contracts (type annotations), resolve call dependencies across the codebase, and verify consistency up and down the call graph, all at CI time, with zero runtime cost. The challenge wasn’t whether to adopt type checking, but how to introduce it into a large, fast-moving codebase without grinding development to a halt.

Why Not Just “Add Types Everywhere”?

The naive approach (annotate the entire codebase, flip on strict mode, fix all errors) doesn’t work at this scale. Instagram’s experience illustrates this difficulty: when they attempted to incorporate types into their million-line Python codebase, it took approximately 8 months to achieve 50% coverage. Manually annotating a large legacy codebase is a brain-draining process, especially for functions with easily inferable types.

Other teams tackling this problem at scale have relied on dedicated typing teams to annotate critical modules top-down, driving adoption through whole-file annotation pushes or version-gated migration sprints. These approaches work when you can concentrate resources on the problem. Our constraints were different — we couldn’t staff a dedicated typing team, and we needed coverage to grow as a byproduct of normal engineering work. This led to the two-tier configuration system and the diff-aware CI pre-check: mechanisms where every PR makes the untyped surface a little smaller, without anyone scheduling a “typing sprint.”

We needed a strategy that was:

  1. Incremental: start small, expand over time, never block the entire organization.
  2. Automated where possible: minimize the manual annotation burden.
  3. Enforceable from day one: even partial type coverage should catch regressions in CI.

Our approach rests on three pillars: a carefully designed two-tier mypy configuration, automated type inference from production traffic (powered by MonkeyType and autotyping) to bootstrap annotations, and CI-level enforcement that raises the bar with every pull request.

Pillar 1: Designing the Mypy Configuration

Mypy is highly configurable, with over 50 flags controlling what it checks and how strictly. Getting the configuration right was critical: too strict and we’d drown in errors from legacy code; too lenient and the type checker becomes toothless.

The Two-Tier System: Strict Mode and Lenient Mode

We designed a two-tier system to serve two fundamentally different use cases.

Strict Mode is for new projects and fully-typed modules. It demands 100% typed code: CI blocks any untyped function. To onboard a module, an engineer adds its path to the strict config, runs mypy locally to confirm zero errors, and merges. From that point forward, no untyped code can enter the module.

Lenient Mode is for legacy modules being gradually typed, and it’s the workhorse of the entire system. In a codebase with hundreds of thousands of existing lines, the vast majority of modules start here. Lenient mode makes a critical compromise: existing untyped code is tolerated by mypy (since disallow_untyped_defs is False), but a separate CI check (the AddedTypesTest, described in detail in Pillar 3) ensures that every new function added in a pull request is fully annotated. This creates a ratchet effect: the untyped surface can only shrink. To onboard, an engineer adds the module to the lenient config and resolves any existing mypy errors from partially-typed code. Once the module reaches 100% annotation coverage, it can be promoted to strict mode.

The key flags that differ between the tiers reflect this philosophy. In lenient mode, disallow_untyped_defs, disallow_incomplete_defs, and disallow_any_generics are all False, lowering the barrier for teams to opt in without fixing every legacy function. But ignore_errors is overridden to False (just like strict mode), so mypy does analyze the typed functions that exist, catching real type errors in annotated code even while untyped functions are silently skipped.

Strict Mode Lenient Mode
Who uses it New projects and fully-typed modules Legacy modules being gradually typed
Core guarantee 100% typed code. CI blocks any untyped function. New/modified code must be typed; existing untyped code is tolerated.
Mypy behavior Rejects any function without full annotations. Checks only annotated functions; skips untyped ones.
New code enforcement By mypy directly (disallow_untyped_defs=True). By a separate AddedTypesTest CI check that parses the PR diff.
Typical onboarding Add path to strict config, run mypy locally, fix any errors, merge. Add path to lenient config, resolve existing errors, merge.
Progression Terminal state: full type safety. Intermediate state: promote to strict once fully annotated.

Both tiers share a common configuration foundation but diverge on key flags. We evaluated every mypy flag and made deliberate choices for each tier. Here are the decisions that mattered most:

Untyped Definitions and Calls

Flag Strict Lenient Rationale
disallow_untyped_defs True False Strict mode demands 100% coverage. Lenient mode allows gradual adoption. Our custom AddedTypesTest separately enforces that new functions are typed.
disallow_incomplete_defs True False Partially annotated functions (e.g. annotated return but unannotated args) are rejected in strict mode.
disallow_untyped_calls False False Even in strict mode, fully-typed modules often call utility functions that aren’t yet annotated. Setting this to True would break typed modules that depend on untyped ones, exactly the kind of cross-module friction we wanted to avoid during gradual adoption.
check_untyped_defs False False In strict mode, there are no untyped defs to check. In lenient mode, we want developers to add types progressively rather than having mypy analyze untyped function bodies.
disallow_untyped_decorators False False Many widely-used decorators in the codebase aren’t typed yet. Enabling this would cause errors in typed functions simply for using a common decorator.

Generics and Dynamic Typing

Flag Strict Lenient Rationale
disallow_any_generics True False In strict mode, writing bare list instead of list[str] is an error. In lenient mode, we relax this to lower the barrier to entry. The trade-off: strict mode requires list[Any] when the element type is truly unknown.
disallow_any_expr False False Even mypy’s own strict mode enables this, but in practice Any is unavoidable in a large codebase with third-party libraries and untyped dependencies. Enabling this would generate a flood of noise.
disallow_subclassing_any True True Subclassing an untyped class is rare and usually a sign of trouble. Safe to enable in both tiers.

Warnings and Return Types

Flag Strict Lenient Rationale
warn_return_any False False This warns when returning Any from a function with a concrete return type. Sounds useful, but in practice it triggers whenever a typed function calls an untyped utility. We’ll enable this once coverage is much higher.
warn_unreachable True False Dead code detection is valuable in strict modules. For lenient modules still being annotated, it generates too many trivial warnings.
warn_redundant_casts True True Low noise, catches real issues.
warn_unused_ignores True True Ensures # type: ignore comments don’t linger after the underlying issue is fixed.

Error Suppression and Missing Imports

Flag Strict Lenient Rationale
ignore_errors (global) True True Set to True at the global level so that modules not opted into either tier produce zero noise. Individual strict/lenient sections override this to False.
ignore_missing_imports True True Without this, mypy complains about every third-party library lacking type stubs. We suppress these globally and use disable_error_code: import-untyped to avoid requiring manual stub installation.

Configuration as Code

With the two tiers defined above, the entire configuration lives in a single JSON file. The JSON carries four things: a global_config with safe defaults for the entire codebase, a strict_mode_config and lenient_mode_config with per-tier flag overrides, and the lists of paths and modules enrolled in each tier:

{
  "enabled": true,
  "global_config": {
    "exclude": ".*/test_.*\\.py$",
    "plugins": "pydantic.mypy",
    "ignore_errors": "True",
    "ignore_missing_imports": "True",
    "disable_error_code": "import-untyped"
  },

  "strict_mode_config": {
    "ignore_errors": "False",
    "disallow_untyped_defs": "True",
    "disallow_incomplete_defs": "True",
    "disallow_any_generics": "True",
    "disallow_untyped_calls": "False",
    "no_implicit_reexport": "True",
    "warn_return_any": "False"
  },

  "lenient_mode_config": {
    "ignore_errors": "False",
    "disallow_untyped_defs": "False",
    "disallow_incomplete_defs": "False",
    "disallow_any_generics": "False",
    "disallow_untyped_calls": "False",
    "no_implicit_reexport": "True",
    "warn_return_any": "False"
  },

  "strict_mode_files": {
    "paths": ["src/billing/", "src/notifications/"],
    "modules": ["billing.*", "notifications.*"]
  },

  "lenient_mode_files": {
    "paths": ["src/user_profiles/", "src/search/"],
    "modules": ["user_profiles.*", "search.*"]
  }
}

From JSON to .mypy.ini: How the Config is Parsed

At CI time, a config parser reads this JSON and dynamically builds a standard .mypy.ini using Python’s ConfigParser. The transformation is deliberate:

  1. The [mypy] global section merges paths from both tiers into a single files list (so mypy knows what to scan) and applies the global_config flags. Critically, the global section sets ignore_errors = True, which means any module not explicitly enrolled in a tier produces zero noise.
  2. A [mypy-<lenient_modules>] section is created by joining all lenient module globs with commas (e.g. [mypy-user_profiles.*,search.*]). This section applies the lenient flag overrides, most importantly ignore_errors = False (so typed code is checked) and disallow_untyped_defs = False (so legacy untyped functions don’t produce errors).
  3. A [mypy-<strict_modules>] section is created similarly for strict modules, applying the full strictness flags.

The resulting .mypy.ini looks like this:

[mypy]
files = /repo/src/billing/,/repo/src/notifications/,/repo/src/user_profiles/,/repo/src/search/
exclude = .*/test_.*\.py$
plugins = pydantic.mypy
ignore_errors = True
ignore_missing_imports = True
disable_error_code = import-untyped
[mypy-billing.*,notifications.*]
ignore_errors = False
disallow_untyped_defs = True
disallow_incomplete_defs = True
disallow_any_generics = True
... 
[mypy-user_profiles.*,search.*]
ignore_errors = False
disallow_untyped_defs = False
disallow_incomplete_defs = False
disallow_any_generics = False
...

This structure is the reason lenient mode works: mypy runs across all enrolled modules, but its behavior diverges per section. For lenient-mode modules, mypy will verify type consistency in functions that have annotations, but will silently skip functions that don’t. The AddedTypesTest (described next) fills the gap that mypy’s lenient config intentionally leaves open.

Pillar 2: Learning Types from Production with Automated Tracing

Adding type annotations manually to a vast legacy codebase is tedious, especially for functions with easily inferable types. We invested in three levels of automation, each handling a progressively harder class of functions.

Level 1: Autotyping for the Obvious Cases

Many functions have types that are trivially inferable from context:

def get_greeting():
    return "hello"

It’s obvious this returns a str. The autotyping library handles these cases automatically. Running it in –safe mode across a module adds return type annotations for functions where the return type is unambiguous from the source code alone. This is the lowest-effort, highest-confidence first pass: pure static analysis, no runtime data required.

Level 2: MonkeyType for Runtime-Observed Types

For functions where the types aren’t statically obvious, we need runtime information. MonkeyType, an open-source tool developed by Instagram, was a critical piece of our automation stack.

MonkeyType uses Python’s sys.setprofile hook to intercept function calls, returns, and generator yields at runtime, recording the concrete types of every argument and return value. By default it stores traces in a SQLite database, and can then generate type stubs or apply annotations directly to source files using LibCST. The workflow is straightforward:

# Collect type information by running your code
monkeytype run myscript.py

 # Generate a type stub for a module
monkeytype stub mymodule

 # Or apply annotations directly to the source
monkeytype apply mymodule

MonkeyType was invaluable for bootstrapping annotations across modules where autotyping couldn’t reach, functions whose return types depend on input values, methods on complex class hierarchies, and code with conditional branches that aren’t trivially analyzable from the AST.

However, MonkeyType’s default architecture (local execution, SQLite storage) doesn’t scale to a production environment serving millions of requests across multiple regions. We needed a way to collect runtime types from production traffic at scale.

Level 3: Scaling MonkeyType to Production with Custom Type Tracing

To bridge MonkeyType’s capabilities with production-scale requirements, we built a custom Type Tracer framework. The core insight: MonkeyType’s architecture is pluggable. Its CallTraceStoreLogger interface allows custom storage backends, and we exploited this to replace the default SQLite store with a distributed pipeline backed by streaming infrastructure and cloud object storage.

The framework collects function argument types and return types at runtime, aggregates them across regions, and generates type stubs that developers can review and apply. We can break the pipeline into three stages:

Tracing. A @trace_types decorator records the inbound and outbound types of a function during production request handling. Calls are sampled at a configurable rate (typically 1-2%) to minimize latency impact. On every request teardown, the collected traces are flushed to a streaming pipeline.

Storing. The streamed traces land in object storage, partitioned by module and region. Deduplication happens at write time (within a request) and at read time (when aggregating across regions), keeping storage costs manageable.

Stubbing. When a developer is ready to type a module, they run MonkeyType’s stub generator with our custom storage config:

monkeytype --config=custom_tracer:TracerConfig() stub myapp.core_module

This pulls traces from object storage, unifies the observed types, and produces a typed stub file. The developer reviews the stub, refines it (especially for generic patterns), and merges the annotations.

For example, given this function:

def get_lastname(profile_obj):
    return profile_obj.lastname

The tracer observes in production that profile_obj is always a Profile instance and the return value is always a str. MonkeyType generates:

def get_lastname(profile_obj: Profile) -> str:
    return profile_obj.lastname

The Limits of Automation

Neither MonkeyType nor any runtime tracing tool understands generics. A utility function that works on any list type will produce a union of every concrete type seen in production:

# What MonkeyType / the tracer generates:
def safe_get(
    lst: list[int | str | dict | SomeClass | ...],
    idx: int,
    default: int | str | dict | None | ...
) -> int | str | dict | SomeClass | None | ...
# What a human knows is correct:
from typing import TypeVar
 
T = TypeVar("T")

 def safe_get(lst: list[T], idx: int, default: T) -> T: ...

We treat the automation pipeline as a bootstrapping tool: it generates a first draft that developers review and refine, particularly for generic code and complex control flow patterns. The progression is: run autotyping for the trivial cases, run MonkeyType (via the Type Tracer) for the runtime-inferable cases, then apply human judgment for generics and complex patterns.

Pillar 3: CI Enforcement — The Two-Layer Gate for Lenient Mode

With annotations coming in from autotyping, MonkeyType-powered production tracing, and manual effort, the CI pipeline enforces correctness and prevents regression. This is where the lenient mode design really comes together.

For strict-mode modules, enforcement is simple: mypy itself blocks any untyped function (disallow_untyped_defs = True). But for lenient-mode modules, mypy intentionally allows untyped functions to exist. So how do we ensure that new code is always typed? The answer is a two-layer enforcement architecture: a custom pre-check runs before mypy, targeting only lenient-mode files, and catches unannotated new functions that mypy is configured to ignore.

Layer 1: The AddedTypesTest — Closing the Lenient Mode Gap

The AddedTypesTest is a purpose-built CI check that uses LibCST to parse the PR diff and enforce one rule: any newly added function in a lenient-mode module must be fully annotated. It runs before mypy and fails fast, giving developers immediate feedback.

Here’s how it works, step by step:

Step 1: Filter to lenient-mode files only. The test takes the list of Python files changed in the PR and filters it down to only those that fall under a lenient-mode path. It reads the lenient_mode_files.paths list from the mypy config JSON and matches each changed file against those paths:

def get_files_to_scan_for_lenient_mode(self, config) -> list[str]:
    lenient_mode_paths = config.get_lenient_mode_paths()
    files_to_scan = []
    for file_path in self.py_files_changed:
        for path in lenient_mode_paths:
            if path_matches_pattern(file_path, path):
                files_to_scan.append(file_path)
                break
    return files_to_scan

Files under strict-mode paths are skipped here because mypy itself handles enforcement for those. Files not enrolled in either tier are also skipped. This filter is what makes the AddedTypesTest exclusively a lenient-mode concern.

Step 2: Extract added lines from the PR patch. For each lenient-mode file, the test parses the PR’s unified diff to identify exactly which line numbers were added (not removed or unchanged). This comes from the GitHub API’s patch data, with a fallback to local git diff for large PRs that exceed GitHub’s response size limits.

Step 3: Parse the file with LibCST to collect function metadata. A TypeCollector CST visitor walks the parsed file and records metadata for every function definition: its parameters, return type annotation (if any), and its start and end line numbers in the source:

 class TypeCollector(cst.CSTVisitor):
    METADATA_DEPENDENCIES = (PositionProvider,)

  def visit_FunctionDef(self, node: cst.FunctionDef) -> bool | None:
        self.stack.append(node.name.value)
        position = self.get_metadata(PositionProvider, node)
        self.functions[tuple(self.stack)] = FunctionMetadata(
            parameters=node.params,
            return_type=node.returns,
            start_line=position.start.line,
            end_line=position.end.line,
        )
        return None

The visitor also tracks class nesting via a stack, so a method Bar.process is recorded as (“Bar”, “process”) rather than just “process”.

Step 4: Identify unannotated functions in newly added code. The test intersects the two data sets: unannotated functions (from LibCST) and added lines (from the diff). A function is considered unannotated if any parameter (other than self) lacks a type annotation or the return type is missing:

def _is_annotated(self, function: FunctionMetadata) -> bool:
    for argument in function.parameters.params:
        if argument.name.value == "self":
            continue
        if argument.annotation is None:
            return False
    return function.return_type is not None

If an unannotated function’s line range falls entirely within the PR’s added lines, it’s a newly added function that should have been typed, and the test flags it. For example, if a developer adds this function at lines 42-47 of a lenient-mode file:

def process_application(user_id, job_id):   # line 42, no annotations!
    user = load_user(user_id)                # line 43
    job = load_job(job_id)                   # line 44
    if not user or not job:                  # line 45
        return None                          # line 46
    return apply(user, job)                  # line 47

The AddedTypesTest detects that lines 42-47 are all newly added, the function lacks annotations, and fails the PR with a message pointing to path/to/file.py::process_application. The developer must add types before merging:

def process_application(user_id: int, job_id: int) -> Application | None:
    ...

Step 5: Fail fast or pass. If any unannotated new functions are found, the test short-circuits CI with a failure message listing every offending function. If all new functions are typed, the test passes and CI proceeds to run mypy.

Layer 2: Mypy with Lenient Configuration

After the AddedTypesTest passes, mypy runs with the generated .mypy.ini. For lenient-mode modules, mypy’s job is different from strict mode:

  • It checks type consistency in annotated functions: if a function has annotations, mypy verifies that the body matches the declared types, that callers pass the right argument types, and that return values are consistent.
  • It silently skips unannotated functions: because disallow_untyped_defs = False, mypy doesn’t enter unannotated function bodies at all.
  • It still catches structural errors: even in lenient mode, mypy catches issues like accessing a non-existent attribute on a typed variable, returning the wrong type from an annotated function, or passing incompatible types to an annotated callee.

This two-layer design means lenient-mode modules get the best of both worlds: forward progress (Layer 1 ensures new code is always typed) and correctness (Layer 2 ensures typed code is actually valid). Neither layer alone would suffice. Mypy alone would happily accept new untyped functions. The AddedTypesTest alone wouldn’t catch type errors in the annotations that do exist.

The Full CI Flow

PR opened / updated
    │
    ▼
CI collects changed Python files (excluding tests)
    │
    ▼
Step 1: AddedTypesTest  [lenient-mode only]
    │  ├─ Filter changed files to lenient-mode paths
    │  ├─ Parse PR diff to find added line numbers
    │  ├─ Parse each file with LibCST to find function definitions
    │  ├─ Flag unannotated functions whose entire body is newly added
    │  └─ Fail fast if any found
    │
    ▼
Step 2: Generate .mypy.ini from config JSON
    │  ├─ [mypy] global: ignore_errors=True (uncovered modules → no noise)
    │  ├─ [mypy-strict_modules]: full enforcement (disallow_untyped_defs=True)
    │  └─ [mypy-lenient_modules]: relaxed rules (disallow_untyped_defs=False)
    │
    ▼
Step 3: Run mypy
    │  ├─ Strict modules: checked with full strictness
    │  ├─ Lenient modules: only annotated functions checked
    │  └─ Unenrolled modules: silently skippe
    │
    ▼
Mandatory check — blocks merge on failure. No exceptions, no overrides.

The Onboarding Experience

We designed the system so that any team can opt in with minimal friction.

Starting a New Project (Strict Mode)

  1. Create the project directory on the main branch.
  2. Add the path and module to the strict_mode_files section of the mypy config JSON.
  3. Run mypy locally to verify zero errors.
  4. Merge the config change. From this point on, no untyped code can enter the module.

Adding Types to a Legacy Module (Lenient Mode)

This is the path most teams take, and it’s designed to be as low-friction as possible.

Phase 1: Opt in. Add the module’s path and module glob to the lenient_mode_files section of the mypy config JSON:

"lenient_mode_files": 
    "paths": [
        ...,
        "src/legacy_module"
    ],
    "modules": [
        ...,
        "legacy_module.*"
    ]
}

Run mypy locally to check for existing errors. If the module is 0% typed, this produces zero errors since mypy skips unannotated functions. If the module is partially typed (some functions have annotations), mypy may flag type inconsistencies in those functions. Resolve any errors and merge the config change. From this point forward, the two-layer CI gate is active: the AddedTypesTest will require annotations on every new function, and mypy will verify correctness of annotated code.

Phase 2: Autotype the easy wins. Run autotyping in safe mode to annotate the trivially-inferable functions (functions that return string literals, boolean constants, None, etc.):

python -m autotyping --safe src/legacy_module

Review the changes, create a PR, and merge.

Phase 3: Trace production types. Enable the Type Tracer (powered by MonkeyType) to collect runtime types from production traffic. After letting it run for a few days, generate stubs and review:

monkeytype --config=custom_tracer:TracerConfig() stub legacy_module.core

This produces a first draft for the more complex functions. Review the stubs for accuracy, particularly for generic utility functions where MonkeyType tends to produce overly-specific union types.

Phase 4: Promote to strict. Once the module reaches 100% annotation coverage, move its entry from lenient_mode_files to strict_mode_files in the config JSON. This is a one-line change that permanently raises the bar: mypy will now reject any untyped function, and the AddedTypesTest no longer applies (it’s redundant when mypy itself enforces full coverage).

The Progression Path

The lifecycle is clear: Untracked -> Lenient -> Strict. Each transition is a small, reviewable config change. Teams can move at their own pace, and the system guarantees that things only get better: lenient mode ensures coverage increases with every PR, and the promotion to strict mode locks in complete type safety.

The Gradual Typing Philosophy

The key insight behind this approach is that partial type checking is still valuable. This is the essence of gradual typing: the type checker analyzes annotated code and silently skips unannotated code.

def get_firstname(profile_id: str) -> str# mypy checks this
    profile = get_profile(profile_id)
    return profile.firstname

 def get_lastname(profile_id):  # no annotations, mypy skips this
    profile = get_profile(profile_id)
    return profile.lastname

Mypy will verify get_firstname for consistency but won’t flag get_lastname. This lets teams annotate at their own pace. But there’s a compounding effect: the more functions that are annotated, the more call-site mismatches the checker can find. We deliberately started annotation efforts from the most heavily-imported utility modules, identified through dependency graph analysis, to maximize this ripple effect. Annotating high-fan-out modules first means every downstream consumer immediately gets better type checking, even before adding their own annotations.

Lessons Learned

  1. Design your mypy config deliberately. Don’t just use –strict. Mypy’s built-in –strict flag enables everything at once, including flags like disallow_untyped_calls and warn_return_any that generate enormous noise in a partially-typed codebase. We evaluated every flag individually and made pragmatic choices for each tier. The config is a living document that gets stricter as coverage grows.
  2. Config-driven CI enforcement scales better than code review. Asking engineers to “please add types” in code review doesn’t scale. Having CI block the merge does. A data-driven config file makes it trivial to onboard new modules and impossible to backslide.
  3. Lenient mode’s “new code must be typed” rule is surprisingly effective. The AddedTypesTest enforces annotations only on newly added functions in lenient-mode modules, meaning the codebase gets more typed with every PR, without anyone needing to schedule dedicated “typing sprints.” It’s type safety through natural attrition. Combined with mypy checking the annotated functions for correctness, lenient mode delivers genuine value from day one, long before a module reaches 100% coverage.
  4. Start from the leaves of your dependency tree. Annotating high-level orchestration code first gives you very little. The type checker can’t validate calls to unannotated dependencies. Start from the bottom (utility modules, data models) and work upward.
  5. MonkeyType is powerful but needs human review. MonkeyType and our production Type Tracer handle the vast majority of non-trivial annotations automatically. But runtime tracing produces concrete types, not abstract ones. Generic utility code, *args/**kwargs patterns, and complex control flow all require human judgment. Treat automated annotations as a first draft, not a final answer.
  6. Keep disallow_untyped_calls off until coverage is very high. This flag causes the most friction in a gradually-typed codebase. A fully-typed module calling an untyped utility function shouldn’t produce an error. That punishes the team that typed their code first. We’ll enable it once our most-imported modules are fully annotated.
  7. Adopt from __future__ import annotations everywhere. This makes all annotations lazily evaluated strings, eliminating circular import issues from type-only imports and enabling use of modern syntax (like X | Y unions) on older Python versions. Combined with typing.TYPE_CHECKING guards for type-only imports, it keeps the runtime footprint of type annotations at zero.
  8. AI-assisted typing is a complement, not a replacement. LLM-based coding tools can generate plausible type annotations, and we expect them to play an increasing role in bootstrapping coverage. But LLM-inferred types are educated guesses — they lack the ground truth of production runtime data that our MonkeyType-powered tracing provides. More fundamentally, the challenge this system addresses isn’t just generating annotations: it’s governing them. Gradually rolling out type enforcement across a 4M-line codebase, preventing regressions in typed modules, and managing per-module strictness through configuration are organizational and infrastructure problems that remain regardless of how annotations are authored. AI tools accelerate the annotation step; the two-tier config, diff-aware CI pre-check, and production tracing pipeline provide the framework that makes those annotations trustworthy at scale.

Where We Stand

Today, we have a growing number of modules under strict mode with full type safety enforcement, and more in lenient mode where every new function must be annotated. MonkeyType-powered production type tracing runs continuously, generating annotation suggestions for modules as they come up for review. Every pull request that touches Python code passes through our mypy gate: mandatory, no exceptions.

The system is designed to ratchet forward. Every week, more modules cross the threshold from lenient to strict. Every PR that adds new code makes the untyped surface a little smaller.

Type checking a large Python codebase isn’t a switch you flip. It’s a gradient you push, and the tooling and configuration you build determine how fast you can push it.

You might also like...

Share Popup Title

Share this article