Week 9: Neuroinformatics -- Standards, Sharing, and Credit¶

Overview¶

A finished analysis is not a finished contribution. The data behind it has to be reproducible, shareable, and citable, or the work dies with the paper. Two standards carry the weight: the Brain Imaging Data Structure (BIDS) answers where everything lives (structure), and Hierarchical Event Descriptors (HED) answer what every event meant (semantics). The single most useful idea this week: the bar for a complete annotation is concrete and falsifiable. A language model should be able to reconstruct the stimulus, or the experiment, from the annotation alone. That is not a metaphor; it is exactly the test demonstrated in the Healthy Brain Network EEG (HBN-EEG) paper (Shirazi et al., 2024, Figure 9), where Claude Sonnet 3.5 regenerated the Surround Suppression stimulus from its HED description with no image.

The dataset throughout is HBN-EEG, the very data the course has analyzed since Week 3. It is itself a BIDS + HED dataset published on both OpenNeuro and NEMAR with exactly the tools this session teaches: the loop closes, the data you analyzed is the worked example for how to share data.

Learning Objectives

Frame data sharing as three locks: structure, semantics, and credit
Read a BIDS dataset: directory layout, JSON sidecars, events.tsv, participants.tsv
Understand HED as the semantics layer: hierarchical, composable, validatable tags in an events.json sidecar
Apply the recreate-the-stimulus bar: an annotation is complete when a language model can rebuild the stimulus from it alone
Use HEDit to turn a rich prose description into validated HED (Parser to Tagger to Validator)
Use /neuroinformatics:bids-conversion and the bids-validator agent (the mechanical defence) to produce a valid dataset
Share with OpenNeuro and NEMAR; use nemar-cli for trivial validation, a private-repo collaboration workflow, and DOIs with ORCID auto-linking
Read a real DataCite metadata gap: the same dataset on NEMAR vs OpenNeuro

Slides¶

Use arrow keys to navigate. Press F for fullscreen, ? for shortcuts.

Guide¶

The rest of this page walks through the workflow in the order the live session follows.

1. The Reuse-and-Credit Gap¶

A lab collects dense, synchronized data. What reaches re-users is a thin events.tsv with one cryptic column and a results figure locked in a folder. Three locks snap shut at once.

A rich raw recording funnels down to a thin shared artifact, with three padlocks: structure, semantics, and credit

Structure -- where is everything? Custom layouts mean every re-user writes glue code first.
Semantics -- what did the events mean? A numeric event code is meaningless outside the lab.
Credit -- who is cited when it is reused? Without a DOI and author identifiers, reuse traces back to no one.

Analysis-ready means no forensic search for unreported details

The information is not lost; it just never reaches the shared artifact. BIDS, HED, and good sharing close the three locks.

2. Two Standards, One Bar¶

BIDS answers where; HED answers what. The bar that judges both is the same: someone, or a language model, can reconstruct your experiment without emailing you.

BIDS (Brain Imaging Data Structure) -- a filesystem convention plus metadata. Structure.
HED (Hierarchical Event Descriptors) -- a controlled, composable, validatable event vocabulary. Semantics.

3. BIDS: One Layout, Every Dataset¶

A filesystem convention plus metadata: predictable names (sub-, ses-, task-), JSON sidecars, and TSV tables.

An annotated BIDS directory tree for one HBN-EEG subject

A BIDS dataset is readable by EEGLAB, MNE-Python, the BIDS validator, and BIDS Apps, and it is the upload format both OpenNeuro and NEMAR expect. Standard structure is also what makes mega-analysis across studies possible. The payoff is leverage, not bureaucracy.

4. Where Structure Ends¶

The JSON sidecar carries acquisition metadata; events.tsv carries the timeline.

{
  "TaskName": "surroundSupp",
  "SamplingFrequency": 500,
  "EEGReference": "Cz",
  "PowerLineFrequency": 60,
  "EEGChannelCount": 128
}

onset    duration   value
0.000    n/a        12
1.500    n/a        14
3.000    n/a        12

Structure tells you where an event sits on the timeline. It cannot tell you what the event was. That gap is semantics.

5. HED: the Semantics Standard¶

events.tsv is thin: an onset and a cryptic numeric code. Stimulus, modality, condition, response, context -- all real, all recorded, none of it in the shared file. (HBN originally shipped numeric codes; the first curation step replaced them with meaningful strings, then annotated with HED.)

One HED tag is a comma-separated path through a controlled schema; the hierarchy carries meaning, so analysis works at any level:

Action, Move, Move-body-part, Move-upper-extremity, Press

HED anatomy: a tag as a schema path, the inheritance tree, and the events.json sidecar pattern

You can analyze at the leaf (Press) or any ancestor (Move). The sidecar pattern keeps events.tsv unchanged; all semantics live in events.json under HED keys.

6. The Bar: Recreate the Stimulus¶

In the HBN-EEG paper, the HED annotation of the Surround Suppression task was handed to Claude Sonnet 3.5 with no image, and the model regenerated the visual stimulus from the annotation alone.

HBN-EEG Figure 9: the intended Surround Suppression stimulus and Claude's regeneration from the HED annotation alone, with what matched and the one miss

Everything structural came back correct: the gratings, the vertical-grating background, central fixation, the contrast relationship, four disks present. The only miss was the disks' size and position -- both awkward to express in HED, so they were left out, and the model had no way to reproduce them.

The completeness test

If a language model can rebuild your stimulus from the annotation alone, the annotation is complete. If it can't, you left something out. That honest miss is also a real lesson: HED nails event semantics, but spatial geometry is hard to encode.

Cite: Shirazi et al. (2024), HBN-EEG: The FAIR implementation of the Healthy Brain Network EEG dataset, bioRxiv 10.1101/2024.10.03.615261.

7. HEDit: AI-assisted HED¶

HED workflows stall for most labs, and it is a workflow problem, not a willingness problem: roughly 2000 tags, expert-only fluency, a validator with cryptic messages.

HEDit pipeline: Parser, Tagger, Validator with a feedback loop

HEDit turns the wall into a paragraph. You write one rich prose description per event value; a Parser to Tagger to Validator pipeline (LangGraph, with the official HED validator in the loop) returns a BIDS-compliant events.json with HED plus a provenance trail. The schema is the contract; no agent invents vocabulary. And HEDit is only as good as the description: it is tuned for exactly the detail the recreate-the-stimulus bar demands.

8. The neuroinformatics Plugin: 2 Skills + 1 Agent¶

/neuroinformatics:bids-conversion -- a guided conversion to BIDS.
bids-validator (agent) -- autonomous validation and fixes. This week's mechanical defence.
/neuroinformatics:experiment-design -- the data-collection side (PsychoPy + Lab Streaming Layer); in the plugin, not today's focus.

A guided six-step conversion ends where the next act begins, validation:

1. Inventory  ->  2. Scaffold  ->  3. Convert files  ->  4. JSON sidecars  ->  5. TSV tables  ->  6. Validate

Modalities: EEG, EMG, MEG, fMRI, and behavioral data.

9. The bids-validator Agent: the Mechanical Defence¶

The agent runs the BIDS validator, categorizes findings, applies fixes with confirmation, re-validates, and reports readiness.

## BIDS Validation Report
Subjects: 12   Modalities: eeg
Errors fixed: 2
  [FIXED] missing dataset_description.json
  [FIXED] _eeg.json missing PowerLineFrequency -> 60
Remaining warnings: 2
Ready for submission: YES

Two checks by design

The agent fixes your data locally; nemar-cli validates again at the upload gate. This is Week 9's cite-the-card / validate_fonts.py: a deterministic gate that turns "looks fine" into pass/fail.

OpenNeuro is the de-facto open BIDS archive: validated on ingest, public, DOI-minted.

Honest caveats

Private upload is possible on OpenNeuro, but only via the command line / direct push (no polished GUI), and the DOI record stays sparse: no ORCID author links, minimal metadata.

NEMAR specializes in EEG/MEG BIDS datasets and sits next to San Diego Supercomputer Center compute, so you can analyze without downloading. HBN-EEG lives on both.

11. nemar-cli: Validation, Upload, Publish¶

Validation is one command, and also runs automatically on upload and on every update pull request:

nemar dataset validate ./my-dataset

The full path, and the collaboration model:

nemar auth login                          # one-time, API key cached
nemar dataset validate ./my-dataset       # BIDS check, must pass
nemar dataset upload ./my-dataset         # creates a private GitHub repo
nemar dataset publish request nm000XXX    # admin approves -> public + DOI

upload creates a private GitHub repository where you are the admin: invite collaborators and push directly while you stage. After publishing, changes go through pull requests and version tags. (OpenNeuro also supports private upload, just command-line only, so the NEMAR advantage is this collaboration model plus the rich DOI metadata, not "private vs public.")

12. DOI Minting + ORCID Auto-link¶

DOI minting and ORCID auto-linking: authors to ORCID iDs to DataCite DOI to each author's ORCID record

On publish, nemar-cli mints a concept DOI (one stable citation across all versions) plus per-version DOIs, via EZID writing DataCite kernel-4 metadata (the DOIs carry the 10.82901/NEMAR.<id> prefix), and auto-links every author's ORCID iD. The dataset then appears on each author's ORCID record automatically. OpenNeuro does not link authors to ORCID on the DOI yet.

13. The Metadata Gap: Proof on a Real Dataset¶

Live DataCite data, the same HBN-EEG Release 1, the same eight authors, two homes.

DataCite metadata comparison: NEMAR nm000103 vs OpenNeuro ds005505, grouped into findability and credit fields

DataCite field	NEMAR `nm000103`	OpenNeuro `ds005505`
Stable concept DOI	yes	no (version-only)
Authors linked to ORCID iD	8 / 8	0 / 8
License	CC-BY-NC-SA-4.0	none
Subject keywords	8	0
Description / abstract	yes	none
Links to papers + related datasets	5	0
Funding references	2	0

Findability and credit are metadata, not luck

OpenNeuro's DOI record carries only a title and author names; NEMAR fills every field. (Source: api.datacite.org, live records.)

14. Live Walkthrough¶

Two small, honest actions, about four minutes total:

HEDit -- write one rich prose description of an HBN event and watch it become a validated HED string.
nemar dataset validate -- run it on the HBN practicum dataset and read the clean BIDS report.

We do not manufacture a pass

If validation surfaces something, we walk it on stage.

15. Before Next Week¶

Install research-skills; it bundles neuroinformatics, figures, manuscript, opencite, grant, project, and presentation.
If you have a small EEG/EMG dataset, try /neuroinformatics:bids-conversion and run the bids-validator agent.
Browse HBN-EEG on NEMAR and OpenNeuro; compare the two DOI records on DataCite Commons.
Optional: try HEDit on one event from your own experiment, written as a rich paragraph.

Week 10 is the capstone: building your own plugin.

Resources¶

Course materials

Week 9 session
Week 9 blog (markdown source)
Course repository
research-skills plugin (bundles neuroinformatics, figures, manuscript, opencite, grant, project, presentation)

Standards and tools

BIDS specification (Brain Imaging Data Structure)
HED (Hierarchical Event Descriptors)
HEDit (natural language to validated HED)
nemar-cli (upload, validate, version, and publish to NEMAR)
OpenNeuro and NEMAR (open BIDS archives)
DataCite Commons (inspect DOI metadata)
ORCID (researcher and contributor identifiers)

Reference

Shirazi et al. (2024), HBN-EEG: The FAIR implementation of the Healthy Brain Network EEG dataset, bioRxiv 10.1101/2024.10.03.615261
Open Science Collective Discord