Agent Skill
2/7/2026

bio-genome-annotation-functional-annotation

Assign GO terms, KEGG orthologs, Pfam domains, and EC numbers to predicted proteins using eggNOG-mapper and InterProScan. Produces functional summaries for downstream pathway and enrichment analysis. Use when adding functional annotation to predicted genes or characterizing protein functions in a new genome.

G
gptomics
202GitHub Stars
1Views
npx skills add GPTomics/bioSkills

SKILL.md

Namebio-genome-annotation-functional-annotation
DescriptionAssign GO terms, KEGG orthologs, Pfam domains, and EC numbers to predicted proteins using eggNOG-mapper and InterProScan. Produces functional summaries for downstream pathway and enrichment analysis. Use when adding functional annotation to predicted genes or characterizing protein functions in a new genome.

name: bio-genome-annotation-functional-annotation description: Assign GO terms, KEGG orthologs, Pfam domains, and EC numbers to predicted proteins using eggNOG-mapper and InterProScan. Produces functional summaries for downstream pathway and enrichment analysis. Use when adding functional annotation to predicted genes or characterizing protein functions in a new genome. tool_type: cli primary_tool: eggNOG-mapper

Version Compatibility

Reference examples tested with: pandas 2.2+

Before using code patterns, verify installed versions match. If versions differ:

  • Python: pip show <package> then help(module.function) to check signatures
  • CLI: <tool> --version then <tool> --help to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Functional Annotation

"Functionally annotate my predicted proteins" → Assign GO terms, KEGG orthologs, Pfam domains, and EC numbers to predicted protein sequences using orthology-based and domain-scan methods.

  • CLI: emapper.py -i proteins.fa --output annotations (eggNOG-mapper), interproscan.sh -i proteins.fa (InterProScan)

Assign functional annotations (GO terms, KEGG orthologs, Pfam domains, EC numbers) to predicted protein sequences using eggNOG-mapper and InterProScan.

eggNOG-mapper

Database Setup

# Download eggNOG v5.0 database (~44 GB)
# Required for local searches; use --data_dir to specify location
download_eggnog_data.py --data_dir /path/to/eggnog_db -y

# Download DIAMOND database only (~9 GB, faster setup)
download_eggnog_data.py --data_dir /path/to/eggnog_db -y -D

# Download taxon-specific databases (optional, smaller)
download_eggnog_data.py --data_dir /path/to/eggnog_db -y -t 2 # Bacteria
download_eggnog_data.py --data_dir /path/to/eggnog_db -y -t 2759 # Eukaryota

Basic Usage

emapper.py \
    -i predicted_proteins.faa \
    --output functional_annot \
    --output_dir eggnog_out \
    --data_dir /path/to/eggnog_db \
    --cpu 16 \
    -m diamond

Key Options

OptionDescription
-iInput protein FASTA
--outputOutput file prefix
--data_dirPath to eggNOG database
-mSearch mode: diamond (fast), mmseqs (sensitive), hmmer
--cpuCPU threads
--tax_scopeTaxonomic scope (auto, Bacteria, Eukaryota, etc.)
--go_evidenceGO evidence filter (experimental, non-electronic, all)
--target_orthologsOrtholog type (one2one, all)
--seed_ortholog_evalueE-value cutoff (default: 0.001)
--seed_ortholog_scoreMin bit score (default: 60)
--overrideOverwrite existing output

With Taxonomic Scope

# Restrict to bacterial orthologs for a prokaryotic genome
emapper.py \
    -i proteins.faa \
    --output annot \
    --output_dir eggnog_out \
    --data_dir /path/to/eggnog_db \
    --cpu 16 \
    -m diamond \
    --tax_scope Bacteria \
    --go_evidence non-electronic

Output Files

eggnog_out/
├── annot.emapper.annotations    # Main annotation table
├── annot.emapper.hits           # DIAMOND/mmseqs hits
├── annot.emapper.seed_orthologs # Best orthologs
└── annot.emapper.pfam           # Pfam domain annotations

Key Output Columns

ColumnContent
seed_orthologBest matching ortholog
evalueE-value of best hit
GOsGO term annotations
ECEnzyme Commission numbers
KEGG_koKEGG ortholog IDs
KEGG_PathwayKEGG pathway mappings
COG_categoryCOG functional category
PFAMsPfam domain annotations
DescriptionFunctional description

InterProScan

InterProScan searches multiple protein signature databases simultaneously.

Basic Usage

interproscan.sh \
    -i predicted_proteins.faa \
    -o interpro_results.tsv \
    -f tsv,gff3 \
    -cpu 16 \
    -goterms \
    -pa

Key Options

OptionDescription
-iInput protein FASTA
-oOutput file
-fOutput formats: tsv, gff3, xml, json
-cpuCPU threads
-gotermsInclude GO term mappings
-paInclude pathway annotations
-applSpecific applications to run (comma-separated)
-dpDisable precalculated match lookup

Select Specific Databases

# Run only Pfam, TIGRFAM, and CDD
interproscan.sh \
    -i proteins.faa \
    -o interpro_results.tsv \
    -f tsv,gff3 \
    -cpu 16 \
    -goterms -pa \
    -appl Pfam,TIGRFAM,CDD

Available Applications

ApplicationDescription
PfamProtein families
TIGRFAMFunctionally equivalent protein families
SUPERFAMILYStructural domain assignments
CDDConserved Domain Database
PANTHERProtein classification
Gene3DStructural domain predictions
CoilsCoiled-coil predictions
MobiDBLiteDisordered regions
SignalPSignal peptides
TMHMMTransmembrane helices

Merging eggNOG and InterProScan Results

Goal: Combine functional annotations from eggNOG-mapper and InterProScan into a single per-protein table with unified GO terms.

Approach: Parse the eggNOG annotation table and InterProScan TSV output separately, aggregate InterProScan hits per protein, merge on protein ID, and deduplicate GO terms from both sources.

import pandas as pd

def parse_eggnog(annotations_file):
    '''Parse eggNOG-mapper annotations output.'''
    df = pd.read_csv(annotations_file, sep='\t', comment='#',
                     header=None, skiprows=5)
    col_names = [
        'query', 'seed_ortholog', 'evalue', 'score', 'eggNOG_OGs',
        'max_annot_lvl', 'COG_category', 'Description', 'Preferred_name',
        'GOs', 'EC', 'KEGG_ko', 'KEGG_Pathway', 'KEGG_Module',
        'KEGG_Reaction', 'KEGG_rclass', 'BRITE', 'KEGG_TC', 'CAZy',
        'BiGG_Reaction', 'PFAMs'
    ]
    df.columns = col_names[:len(df.columns)]
    return df

def parse_interproscan_tsv(tsv_file):
    '''Parse InterProScan TSV output.'''
    col_names = [
        'protein_id', 'md5', 'length', 'analysis', 'signature_acc',
        'signature_desc', 'start', 'stop', 'score', 'status', 'date',
        'interpro_acc', 'interpro_desc', 'go_terms', 'pathways'
    ]
    df = pd.read_csv(tsv_file, sep='\t', header=None, names=col_names)
    return df

def merge_annotations(eggnog_file, interpro_file):
    '''Merge eggNOG and InterProScan annotations per protein.'''
    eggnog_df = parse_eggnog(eggnog_file)
    interpro_df = parse_interproscan_tsv(interpro_file)

    interpro_summary = interpro_df.groupby('protein_id').agg({
        'signature_acc': lambda x: ','.join(x.dropna().unique()),
        'interpro_acc': lambda x: ','.join(x.dropna().unique()),
        'go_terms': lambda x: '|'.join(x.dropna().unique()),
    }).reset_index()
    interpro_summary.columns = ['query', 'interpro_signatures', 'interpro_ids', 'interpro_go']

    merged = eggnog_df.merge(interpro_summary, on='query', how='outer')

    merged['all_go'] = merged.apply(
        lambda row: combine_go_terms(row.get('GOs', ''), row.get('interpro_go', '')), axis=1
    )
    return merged

def combine_go_terms(eggnog_go, interpro_go):
    '''Combine GO terms from both sources, removing duplicates.'''
    terms = set()
    for go_str in [eggnog_go, interpro_go]:
        if pd.notna(go_str) and go_str != '-':
            terms.update(t.strip() for t in str(go_str).replace('|', ',').split(',') if t.strip().startswith('GO:'))
    return ','.join(sorted(terms)) if terms else '-'

Annotation Statistics

def annotation_summary(merged_df):
    '''Summarize functional annotation coverage.'''
    total = len(merged_df)
    has_go = (merged_df['all_go'] != '-').sum()
    has_kegg = merged_df['KEGG_ko'].notna().sum() if 'KEGG_ko' in merged_df else 0
    has_pfam = merged_df['PFAMs'].notna().sum() if 'PFAMs' in merged_df else 0
    has_ec = merged_df['EC'].notna().sum() if 'EC' in merged_df else 0
    has_desc = (merged_df['Description'] != '-').sum() if 'Description' in merged_df else 0

    print(f'Total proteins: {total}')
    print(f'With GO terms: {has_go} ({has_go/total:.1%})')
    print(f'With KEGG orthologs: {has_kegg} ({has_kegg/total:.1%})')
    print(f'With Pfam domains: {has_pfam} ({has_pfam/total:.1%})')
    print(f'With EC numbers: {has_ec} ({has_ec/total:.1%})')
    print(f'With description: {has_desc} ({has_desc/total:.1%})')

    # Annotation coverage target: >60% with at least one functional term
    has_any = ((merged_df['all_go'] != '-') | merged_df['PFAMs'].notna() | merged_df['KEGG_ko'].notna()).sum()
    print(f'With any annotation: {has_any} ({has_any/total:.1%})')

Troubleshooting

Low Annotation Rate

  • Check protein sequence quality (no fragmented ORFs)
  • Try broader taxonomic scope (--tax_scope auto)
  • Run both eggNOG-mapper and InterProScan and merge results

eggNOG Database Errors

  • Verify database version matches emapper version
  • Re-download with download_eggnog_data.py --data_dir /path -y

InterProScan Memory Issues

  • Reduce batch size with -b option
  • Split input FASTA into smaller chunks

Related Skills

  • prokaryotic-annotation - Bakta includes basic functional annotation
  • eukaryotic-gene-prediction - Produces protein sequences for functional annotation
  • pathway-analysis/go-enrichment - Enrichment analysis using GO annotations
  • pathway-analysis/kegg-pathways - Pathway mapping with KEGG orthologs
Skills Info
Original Name:bio-genome-annotation-functional-annotationAuthor:gptomics