analyst
Support academic research data analysis using dbt and Streamlit. Use this skill when working with computational research projects (identified by dbt/ directory, Streamlit apps, or empirical data pipelines). The skill enforces academicOps best practices for reproducible, transparent, self-documenting research with collaborative single-step workflow.
SKILL.md
| Name | analyst |
| Description | Support academic research data analysis using dbt and Streamlit. Use this skill when working with computational research projects (identified by dbt/ directory, Streamlit apps, or empirical data pipelines). The skill enforces academicOps best practices for reproducible, transparent, self-documenting research with collaborative single-step workflow. |
name: analyst description: Support academic research data analysis using dbt and Streamlit. Use this skill when working with computational research projects (identified by dbt/ directory, Streamlit apps, or empirical data pipelines). The skill enforces academicOps best practices for reproducible, transparent, self-documenting research with collaborative single-step workflow. category: instruction allowed-tools: Read,Grep,Glob,Edit,Write,Bash,Skill version: 2.0.0 permalink: skills-analyst-skill triggers:
- "data analysis"
- "dbt project"
- "streamlit app"
- "research pipeline"
Analyst
Taxonomy note: This skill provides domain expertise (HOW) for research data analysis using dbt and Streamlit. See [[TAXONOMY.md]] for the skill/workflow distinction.
Overview
Support academic research data analysis by working collaboratively with dbt (data build tool) and Streamlit dashboards. This skill enforces academicOps methodology: reproducible data pipelines, automated testing, self-documenting code, and fail-fast validation.
Core principle: Take ONE action at a time (generate a chart, update database, create a test), then yield to the user for feedback before proceeding.
šØ CRITICAL: Research Data is Immutable
Source datasets, ground truth labels, experimental records, and research configurations are SACRED. NEVER modify, reformat, or "fix" them. If infrastructure doesn't support a format: HALT and report. Violations are scholarly misconduct.
šØ CRITICAL: Transformation Boundary Rule
ALL data transformation happens in dbt. Period.
This is non-negotiable for academic integrity, reproducibility, and auditability.
| Layer | Allowed | Prohibited |
|---|---|---|
| dbt | ALL SQL transformations, joins, aggregations, filtering, business logic | - |
| Streamlit | Display, formatting, interactive filtering of PRE-COMPUTED data | SQL that transforms, joins, aggregates, or applies business logic |
Why This Matters (Academic Integrity)
- Reproducibility: Anyone can re-run
dbt buildand get identical results - Auditability: Transformation logic is version-controlled and testable
- Transparency: Reviewers see exactly how data was processed
- Testing: dbt tests PROVE transformations work correctly
The Rule in Practice
Need a new metric? ā Create a dbt mart with tests Need to filter data? ā Pre-compute filtered views in dbt OR use Streamlit widgets on EXISTING columns (no new calculations) Need to join tables? ā Create a dbt model that joins them Need aggregations? ā Create a dbt mart with the aggregations
Streamlit: Display Layer ONLY
Streamlit scripts may:
- ā
SELECT * FROM mart_name(read pre-computed data) - ā
WHERE column = :user_selection(filter on existing columns) - ā Format numbers, dates for display
- ā Create interactive widgets that filter existing data
- ā Render charts from pre-computed metrics
Streamlit scripts must NEVER:
- ā
SELECT SUM(...) GROUP BY ...(aggregation = transformation) - ā
SELECT a.*, b.* FROM a JOIN b(joins = transformation) - ā
SELECT CASE WHEN ... END(business logic = transformation) - ā Calculate derived metrics inline
- ā Apply any formula that changes the meaning of data
If You're Tempted to Transform in Streamlit
STOP. Create a dbt mart instead:
- Create
marts/mart_name.sqlwith the transformation - Add tests in
schema.ymlproving it works - Run
dbt build --select mart_name - THEN query the mart from Streamlit
This takes more time. That's the point. Transformations deserve scrutiny.
Documentation Index
Instructions (_CHUNKS/)
- Investigation: [[instructions/data-investigation.md]], [[instructions/exploratory-analysis.md]]
- Research docs: [[instructions/research-documentation.md]] (REQUIRED), [[instructions/methodology-files.md]], [[instructions/methods-vs-methodology.md]], [[instructions/experiment-logging.md]]
- Technical: [[instructions/dbt-workflow.md]], [[instructions/streamlit-workflow.md]]
References
[[references/dbt-workflow.md]], [[references/streamlit-patterns.md]], [[references/context-discovery.md]]
Statistical Analysis (references/)
Start with [[references/statistical-analysis.md]] (complete guide). Also: [[references/test_selection_guide.md]], [[references/assumptions_and_diagnostics.md]], [[references/effect_sizes_and_power.md]], [[references/bayesian_statistics.md]], [[references/reporting_standards.md]].
Python Libraries
Core libraries: [[references/matplotlib.md]], [[references/seaborn.md]], [[references/statsmodels.md]], [[references/streamlit.md]]. Use python-dev skill for code standards.
When to Use This Skill
Invoke this skill when:
- Working in computational research projects - Directory contains
dbt/, Streamlit apps, or empirical data pipelines - User requests data analysis - "Analyze X", "Create a chart showing Y", "Explore the relationship between Z"
- Building or updating dashboards - Streamlit visualization work
- Creating or modifying dbt models - Data transformation pipelines
- Validating data quality - Adding tests, checking consistency
Key indicators in project structure:
dbt/models/directory (staging, intermediate, marts)streamlit/or.pyfiles with Streamlit codedata/warehouse.dbor similar analytical database- Academic research focus (papers, empirical analysis)
Workflow Decision Tree
START
ā
āā Is this a new analysis task?
ā āā YES ā Go to: Context Discovery
ā āā NO ā Is context already loaded?
ā āā YES ā Go to: Task Execution
ā āā NO ā Go to: Context Discovery
ā
Context Discovery (REQUIRED FIRST STEP)
ā
āā Read project context files:
ā āā README.md (current directory + all parents to project root)
ā āā data/README.md (if exists)
ā āā data/projects/[project-name].md (if exists)
ā
āā Identify project conventions:
ā āā Research questions
ā āā Data sources and access patterns
ā āā Existing dbt models (list them)
ā āā Testing strategy
ā āā Project-specific rules
ā
āā Proceed to: Task Execution
ā
Task Execution
ā
āā What type of task?
ā āā Data access ā Go to: Data Access Workflow
ā āā Visualization ā Go to: Visualization Workflow
ā āā dbt model ā Go to: DBT Model Workflow
ā āā Testing ā Go to: Testing Workflow
ā āā Exploration ā Go to: Exploratory Analysis
ā
āā After completing ONE step:
āā Report results to user
āā Explain what was done
āā STOP and wait for user feedback
Context Discovery
CRITICAL FIRST STEP: Before any analysis work, automatically discover and read project context.
Required Context Files
-
Project README files
- Current working directory
README.md - All parent directories up to project root (e.g.,
papers/automod/,projects/buttermilk/) - Purpose: Understand research questions, conventions, project structure
- Current working directory
-
Data README
data/README.mdin the project- Purpose: Understand data sources, schema, access patterns
-
Project overview
data/projects/[project-name].mdcorresponding to current project- Purpose: Strategic context, goals, status
Context Extraction
From these files, identify:
- Research questions - What is this project investigating?
- Data sources - Where does data come from? (BigQuery, APIs, files?)
- Existing dbt models - What models already exist? (Run
ls -1 dbt/models/**/*.sql) - Conventions - Naming patterns, coding standards, project-specific rules
- Testing strategy - What tests exist? What quality expectations?
- Tools and technologies - DuckDB? PostgreSQL? Specific Python packages?
Example context discovery:
# List existing dbt models
ls -1 dbt/models/staging/*.sql dbt/models/marts/*.sql
# Check for Streamlit apps
ls -1 streamlit/*.py
# Understand project structure
cat README.md
cat data/README.md
After context discovery, summarize findings to user:
"I've reviewed the project context. This is a <research topic> project investigating <questions>. The DBT pipeline has <N> staging models and <M> mart models. I see existing work on <areas>. What would you like me to help with?"
Follow Data Access Workflow
šØ CRITICAL RULE: ALL data access MUST go through dbt models. NEVER query upstream sources directly.
šØ REMINDER: If you need to transform data, that transformation MUST be a dbt model with tests. See "Transformation Boundary Rule" above.
Decision Tree
Need data for analysis?
ā
āā Does required data exist in dbt marts?
ā āā YES ā Use `SELECT * FROM {{ ref('mart_name') }}`
ā ā āā Done! Use this data in analysis.
ā ā
ā āā NO ā Does it exist in staging models?
ā āā YES ā Should this become a new mart?
ā ā āā YES ā Go to: DBT Model Workflow (create mart)
ā ā āā NO ā Use staging model for exploratory work
ā ā
ā āā NO ā Data doesn't exist in dbt yet
ā āā Ask user: "Should I create a dbt model for [data source]?"
ā āā YES ā Go to: DBT Model Workflow (create staging model)
ā āā NO ā Stop. Cannot proceed without dbt model.
Prohibited Actions
ā NEVER do this:
# Direct BigQuery query - PROHIBITED
df = client.query("SELECT * FROM bigquery.raw.cases").to_dataframe()
# Direct database query - PROHIBITED
df = pd.read_sql("SELECT * FROM raw_schema.table", engine)
# Direct API call for analysis data - PROHIBITED
response = requests.get("https://api.example.com/data")
ā ALWAYS do this:
# Query through dbt mart - CORRECT
import duckdb
conn = duckdb.connect("data/warehouse.db")
df = conn.execute("SELECT * FROM fct_case_decisions").df()
# Or reference in Streamlit
@st.cache_data
def load_data():
conn = duckdb.connect("data/warehouse.db")
return conn.execute("SELECT * FROM fct_case_decisions").df()
Why This Matters
- Reproducibility: Queries are version-controlled in dbt
- Data governance: dbt models are single source of truth
- Quality: Data passes through validated transformation pipeline
- Consistency: All analysts use same transformations
See: [[references/dbt-workflow.md]] for detailed dbt patterns
Follow DBT Model Workflow
Create or modify dbt models following academicOps layered architecture.
For detailed dbt workflow including model layers, single-step workflow, and examples, see [[instructions/dbt-workflow.md]]
Quick Reference: Model Layers
- Staging (
stg_*) - Clean and standardize raw data (no business logic) - Intermediate (
int_*) - Business logic transformations (can be ephemeral) - Marts (
fct_*,dim_*) - Analysis-ready datasets (materialized)
Quick Reference: Workflow Pattern
- Create model file ā STOP, show user
- Add documentation ā STOP, show user
- Add tests ā STOP, show user
- Run model and tests ā STOP, report results
ALWAYS check for duplicate models before creating new ones.
See: [[instructions/dbt-workflow.md]] for complete workflow details and [[references/dbt-workflow.md]] for comprehensive patterns
Follow Visualization Workflow
Create Streamlit visualizations following single-step collaborative pattern.
šØ REMINDER: Streamlit is DISPLAY ONLY. No transformations. See "Transformation Boundary Rule" above.
For detailed Streamlit workflow including structure, single-step patterns, and examples, see `@reference _CHUNKS/streamlit-workflow.md]]
Quick Reference: Streamlit Pattern
Load data ā STOP ā Create chart ā STOP ā Add interactivity ā STOP. One change at a time. Hot Reloads: Don't restart Streamlit; it auto-reloads. See [[instructions/streamlit-workflow.md]].
Follow Testing Workflow
Add tests to validate data quality at every pipeline stage.
Testing Strategy
Use appropriate test type for the validation:
| Test Type | Use For | Example |
|---|---|---|
| Schema tests | Column-level checks | not_null, unique, accepted_values |
| Singular tests | Multi-column logic | Date range validation, cross-table consistency |
| Package tests | Common patterns | Recency checks, multi-column uniqueness |
| Diagnostic models | Quality monitoring | Aggregated metrics for manual review |
Follow Single-Step Testing Workflow
Step 1: Identify what to test
Review the model and ask:
- Which columns should never be null?
- Which columns should be unique?
- Are there accepted value lists?
- Any date range logic to validate?
STOP. Discuss with user which tests to add.
Step 2: Add schema tests (after user agrees on test plan)
# dbt/schema.yml
models:
- name: stg_cases
columns:
- name: case_id
tests:
- unique
- not_null
- name: status
tests:
- accepted_values:
values: ["pending", "reviewed", "published"]
STOP. Show to user.
Step 3: Run tests (after user approves test definitions)
dbt test --select stg_cases
STOP. Report results. If failures, discuss with user before fixing.
Step 4: Add singular test if needed (complex validation)
-- tests/assert_decision_dates_logical.sql
select
case_id,
submission_date,
decision_date
from {{ ref('stg_cases') }}
where decision_date < submission_date
STOP. Show test SQL to user.
Step 5: Run singular test
dbt test --select test_name:assert_decision_dates_logical
STOP. Report results.
Test Severity
Use severity: warn for known issues or aspirational standards:
tests:
- not_null:
severity: warn # Don't fail build, just warn
Pipeline/Template Validation Tests
When testing LLM pipelines or templated content, validate substantive content not just error patterns:
- ā Check content length minimums (e.g., criteria block > 100 chars)
- ā Verify required sections exist AND have content
- ā
Use position-based length for multiline content (regex
.*?doesn't cross newlines) - ā Don't just check for specific error strings - upstream bugs are unpredictable
See: [[references/dbt-workflow.md]] for complete testing patterns
Follow Data Investigation Workflow
When investigating data quality issues (missing values, unexpected patterns, join coverage), create REUSABLE investigation scripts in analyses/ directory. Never use throwaway one-liners for data investigation.
For complete workflow, script templates, and when to create investigation scripts, see `@reference _CHUNKS/data-investigation.md]]
Exploratory Analysis
When exploring data patterns and relationships, follow collaborative discovery process. Take one analytical step at a time, yielding to user after each finding.
For complete exploration workflow and anti-patterns, see `@reference _CHUNKS/exploratory-analysis.md]]
NOTE: For data quality issues (missing values, unexpected nulls), use Data Investigation Workflow instead.
Documentation Philosophy
Self-documenting work: Do NOT create separate analysis reports or random documentation files.
šØ CRITICAL: Research projects must follow STRICT documentation structure. See `@reference _CHUNKS/research-documentation.md]] for complete requirements.
Required Documentation Structure
Research projects MUST maintain:
- README.md - Project overview and quick start
- METHODOLOGY.md - Research design and approach (see `@reference _CHUNKS/methodology-files.md]])
- methods/*.md - Technical implementation details (see `@reference _CHUNKS/methods-vs-methodology.md]])
- data/README.md - Data sources and schema
- dbt/schema.yml - Model and column documentation
- experiments/YYYYMMDD-description/ - Experimental work (see `@reference _CHUNKS/experiment-logging.md]])
Where Analysis Documentation Lives
- Streamlit dashboards - Interactive exploration and validation
- Jupyter notebooks - Detailed analysis with inline markdown (in experiments/ if exploratory)
- GitHub issues - Track analysis tasks and decisions
- Code comments - Explain analytical decisions in dbt models
- Commit messages - Document why changes were made
- dbt schema.yml - Document model purposes and column meanings
- methods/*.md - Technical method specifications
Prohibited
ā Create analysis_report.md]] or any random markdown files ā Createfindings_summary.docx` ā Proliferate documentation files without defined structure ā Leave documentation stale when code changes
ā Follow strict structure defined in [[instructions/research-documentation.md]] ā Update documentation in SAME commit as code changes ā One source of truth for each piece of information
Collaborative Workflow Principles
One step at a time:
- Perform ONE action (create chart, write model, run test)
- Show results to user
- Explain what was done and what it means
- STOP and wait for user feedback
- Proceed based on user direction
Never:
- Create multiple artifacts without checkpoints
- Make assumptions about next steps
- Implement complex workflows end-to-end without user input
Always:
- Explain options and ask for user preference
- Show intermediate results
- Yield control back to user frequently
Quick Reference
See [[references/quick-reference-commands.md]] for common dbt, Streamlit, and DuckDB commands.