Data Dictionary
A data dictionary or a data codebook is a document directly describing data elements and variables. It is an essential component in facilitating FAIR research data management by helping others to better understand and potentially reuse the datasets. It is typically published alongside the research data in the form of a README file, which describes the files, their organization, and provides an overview of the methodologies used for data collection.
Data dictionaries and codebooks describe data variables and structures, including variable names, descriptions, value ranges, and units. These elements prevent misinterpretation of data in different cultural contexts from the original authors and help others better understand the datasets for future reuse.
What should be in a data dictionary
A good data dictionary or codebook should provide sufficient information about each variable to be self-explanatory and easily interpreted by others. There are various schools of thought about the essential attributes in a dictionary, such as OSF: How to Make a Data Dictionary, University of Pennsylvania Libraries Data Management Resources on Codebooks & Data Dictionaries. We find that the resource by UBC Library Research Commons includes an adequate set of “must-have” attributes to include in a data dictionary for tabular datasets.
- Column (variable) name: machine-readable name of the variable labelled in the dataset
- Column name in plain English: human-readable name of the variable
- Description of the Column: a brief description of what the variable represents
- Data type: type of data stored in the column (e.g., integer, string, datetime)
- Data usage type: inteneded use or role of data column
- Sample values: examples of values that might be found in the column
Evaluating and Improving Quality of Data Dictionary
Data documentation is essential for the reproducibility, reuse and understanding of research data. A data dictionary, which describes elements and variables in a human-readable format, is an important aspect of data documentation. However, even with a data dictionary, poor documentation quality can still hinder the ability of other researchers to understand and re-use these data in the future.
Even with a data dictionary that includes essential attributes and more, one might still question whether the data dictionary is easy to understand and if the variable descriptions are clear enough for others or the descriptions of the data type are concise enough. In other words, is there an objective approach to evaluate the quality of the dictionary?
With the emergence of generative artificial intelligence and large language models, machines are now capable of interpreting and responding to information in natural language with high accuracy and confidence. These capabilities can be leveraged to evaluate data dictionaries by interpreting the provided information and generating new insights or content.
Let’s look at an example.
Montocchio, Danielle, 2025, “Data for “Effects of two decades of atypical water levels on fish abundance, diversity, and species composition in Georgian Bay coastal wetlands””, https://doi.org/10.5683/SP3/0SJGRA, Borealis, V1
The repository of this dataset includes a README file that describes the data files and their organization, as well as a simple data dictionary with variable labels and descriptions.
3.1 FOR: [fish_data.csv]
Number of variables: 18
Number of cases/rows: 13883
Variable List:
site - Unique 3-4 digit site ID code.
name - Wetland site name.
subreg - Assigned sub-region for a given wetland site
lake - Lake a given wetland site is connected to.
date - Date of data collection (dd/mm/yyyy).
yr - Year of data collection. Additional years included that were not used in final study.
lat - General site latitude coordinate (decimal degrees).
long - General site longitude coordinate (decimal degrees).
sp - Unique 4 digit species code. All species included, as well as those that could not be identified past genera (i.e. code ends in SP).
sciname - Scientific name of fish species.
comname - Common name of fish species.
len - Total length of individual fish (mm).
xfish - Total count of individuals of a specific species over 15 individuals (see Seilheimer and Chow-Fraser, 2007). Value is for individuals caught in the fyke nets of a single site for that specific date.
xfish_len - Mean total length of extra fish of a specific species (calculated from the len field of the previous 15 measured individuals).Value is for individuals caught in the fyke nets of a single site for that specific date.
spabun - Species abundance. Value is for # of individuals of a specific species caught in the fyke nets of a single site for that specific date.
fam - Species family.
funcfeed - Individuals assigned functional feeding group, based on length to approximate age and therefore primary diet.
netsize - Size of fyke net. SF = small; LF = large fyke. (see Montocchio and Chow-Fraser, 2025). Blank cells = no data available **Note: not completely populated (data missing).**
Let’s pass this data dictionary to a large language model (LLM), such as ChatGPT, using the following prompt:
Can you please generate 5 sample data using the following data dictionary list,
3.1 FOR: [fish_data.csv]
Number of variables: 18
Number of cases/rows: 13883
Variable List:
site - Unique 3-4 digit site ID code.
name - Wetland site name.
subreg - Assigned sub-region for a given wetland site
lake - Lake a given wetland site is connected to.
date - Date of data collection (dd/mm/yyyy).
yr - Year of data collection. Additional years included that were not used in final study.
lat - General site latitude coordinate (decimal degrees).
long - General site longitude coordinate (decimal degrees).
sp - Unique 4 digit species code. All species included, as well as those that could not be identified past genera (i.e. code ends in SP).
sciname - Scientific name of fish species.
comname - Common name of fish species.
len - Total length of individual fish (mm).
xfish - Total count of individuals of a specific species over 15 individuals (see Seilheimer and Chow-Fraser, 2007). Value is for individuals caught in the fyke nets of a single site for that specific date.
xfish_len - Mean total length of extra fish of a specific species (calculated from the len field of the previous 15 measured individuals).Value is for individuals caught in the fyke nets of a single site for that specific date.
spabun - Species abundance. Value is for # of individuals of a specific species caught in the fyke nets of a single site for that specific date.
fam - Species family.
funcfeed - Individuals assigned functional feeding group, based on length to approximate age and therefore primary diet.
netsize - Size of fyke net. SF = small; LF = large fyke. (see Montocchio and Chow-Fraser, 2025). Blank cells = no data available **Note: not completely populated (data missing).**
And observe the resulting response.
Here are five sample rows of data for your **fish_data.csv** based on the data dictionary you provided:
| site | name | subreg | lake | date | yr | lat | long | sp | sciname | comname | len | xfish | xfish_len | spabun | fam | funcfeed | netsize |
|-------|------------|--------|----------|------------|------|-----------|-----------|------|------------------|--------------|------|-------|-----------|--------|----------------|------------|---------|
| 1023 | Marsh Bay | NW | Lake Huron | 15/06/2023 | 2023 | 45.1234 | -81.5678 | 0301 | *Micropterus salmoides* | Largemouth Bass | 305 | 18 | 312 | 18 | Centrarchidae | Piscivore | LF |
| 2045 | Willow Wetland | SE | Lake Erie | 07/07/2023 | 2023 | 42.6589 | -80.1234 | 0412 | *Perca flavescens* | Yellow Perch | 145 | 22 | 150 | 22 | Percidae | Invertivore| SF |
| 3087 | Cedar Marsh | NW | Lake Ontario | 20/08/2023 | 2023 | 43.2034 | -79.4567 | 0156 | *Esox lucius* | Northern Pike | 420 | 25 | 430 | 25 | Esocidae | Piscivore | LF |
| 4120 | Bay Wetland | NE | Lake Superior | 10/09/2023 | 2023 | 47.9001 | -87.2345 | 0250 | *Lepomis macrochirus* | Bluegill | 130 | 16 | 135 | 16 | Centrarchidae | Invertivore| SF |
| 5032 | Meadow Wetland | SW | Lake Michigan | 05/06/2023 | 2023 | 44.6789 | -85.9876 | 0110 | *Amia calva* | Bowfin | 610 | 10 | 600 | 10 | Amiidae | Piscivore | LF |
### Explanation:
- **site:** Unique site ID ranging from 3 to 4 digits.
- **name:** Wetland site name.
- **subreg:** Sub-region identifier (NW, SE, NE, SW).
- **lake:** One of the Great Lakes.
- **date/yr:** Dates set within 2023 for consistency with recent data.
- **lat/long:** Approximate decimal degrees for various wetland sites.
- **sp:** Unique species code (randomly generated 4-digit number).
- **sciname/comname:** Scientific and common names of fish species common to North America.
- **len:** Total length of individual fish in mm.
- **xfish/xfish_len:** Counts and mean lengths of extra individuals.
- **spabun:** Species abundance, matching xfish count.
- **fam:** Species family.
- **funcfeed:** Functional feeding group based on diet (Piscivore, Invertivore).
- **netsize:** Fyke net size (LF = Large Fyke; SF = Small Fyke).
Let me know if you want this in CSV format!
site | name | subreg | lake | date | yr | lat | long | sp | sciname | comname | len | xfish | xfish_len | spabun | fam | funcfeed | netsize |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1023 | Marsh Bay | NW | Lake Huron | 15/06/2023 | 2023 | 45.1234 | -81.5678 | 0301 | Micropterus salmoides | Largemouth Bass | 305 | 18 | 312 | 18 | Centrarchidae | Piscivore | LF |
2045 | Willow Wetland | SE | Lake Erie | 07/07/2023 | 2023 | 42.6589 | -80.1234 | 0412 | Perca flavescens | Yellow Perch | 145 | 22 | 150 | 22 | Percidae | Invertivore | SF |
3087 | Cedar Marsh | NW | Lake Ontario | 20/08/2023 | 2023 | 43.2034 | -79.4567 | 0156 | Esox lucius | Northern Pike | 420 | 25 | 430 | 25 | Esocidae | Piscivore | LF |
4120 | Bay Wetland | NE | Lake Superior | 10/09/2023 | 2023 | 47.9001 | -87.2345 | 0250 | Lepomis macrochirus | Bluegill | 130 | 16 | 135 | 16 | Centrarchidae | Invertivore | SF |
5032 | Meadow Wetland | SW | Lake Michigan | 05/06/2023 | 2023 | 44.6789 | -85.9876 | 0110 | Amia calva | Bowfin | 610 | 10 | 600 | 10 | Amiidae | Piscivore | LF |
This generated data sample looks quite indistinguishable with data from the actual dataset. However, there are still some discrepancies between the actual data and the generated data.
site
/sp
: Faulty generated data based on incorrect variable description providedsubreg
: The model took assumption to interpret this field to be the bearing of the wetland land site based on the variable description providedlat
/long
: Both fields allow floating point numbers to 5 decimal placesxfish
/xfish_len
: Both fields can include null values and they are floating point numbersnetsize
: Net size field can include null values
site | name | subreg | lake | date | yr | lat | long | sp | sciname | comnam | len | xfish | xfish_len | spabun | fam | funcfeed | netsize |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MER | Mermaid Island | Honey Harbour | Georgian Bay | 19/07/2021 | 2021 | 44.879080 | -79.817430 | LEGI | Lepomis gibbonus | pumpkinseed | nan | 21.000000 | 78.130000 | 74 | Centrarchidae | Carnivore | nan |
MER | Mermaid Island | Honey Harbour | Georgian Bay | 19/07/2021 | 2021 | 44.879080 | -79.817430 | LEMA | Lepomis macrochirus | bluegill | nan | 60.000000 | 103.890000 | 141 | Centrarchidae | Carnivore | nan |
MER | Mermaid Island | Honey Harbour | Georgian Bay | 19/07/2021 | 2021 | 44.879080 | -79.817430 | LEMA | Lepomis macrochirus | bluegill | nan | 45.000000 | 80.000000 | 141 | Centrarchidae | Carnivore | nan |
MER | Mermaid Island | Honey Harbour | Georgian Bay | 19/07/2021 | 2021 | 44.879080 | -79.817430 | LEGI | Lepomis gibbonus | pumpkinseed | nan | 23.000000 | 88.000000 | 74 | Centrarchidae | Carnivore | nan |
HG | Hog Bay | Severn Sound | Georgian Bay | 08/07/2015 | 2015 | 44.731340 | -79.802860 | LEGI | Lepomis gibbonus | pumpkinseed | nan | 62.000000 | 107.130000 | 236 | Centrarchidae | Carnivore | nan |
CM | Cormican Bay | Franklin Island | Georgian Bay | 12/08/2003 | 2003 | 45.405881 | -80.305820 | LEGI | Lepomis gibbonus | pumpkinseed | 97.000000 | nan | nan | 84 | Centrarchidae | Carnivore | SF |
OJ | Ojibway Bay | Beausoleil Island | Georgian Bay | 24/06/2009 | 2009 | 44.887780 | -79.855730 | PEFL | Perca flavescens | yellow perch | 92.000000 | nan | nan | 8 | Percidae | Carnivore | nan |
MNC | Miner's Creek | Tadenac Bay | Georgian Bay | 25/07/2006 | 2006 | 45.060394 | -79.951280 | LEGI | Lepomis gibbonus | pumpkinseed | 105.000000 | nan | nan | 61 | Centrarchidae | Carnivore | LF |
HW | Hole in the Wall | Pointe au Baril | Georgian Bay | 01/08/2007 | 2007 | 45.521920 | -80.438590 | PINO | Pimephales notatus | bluntnose minnow | 46.000000 | nan | nan | 3 | Cyprinidae | Benthivore | nan |
KE | Key River | Key River | Georgian Bay | 10/08/2016 | 2016 | 45.885510 | -80.676170 | LEGI | Lepomis gibbonus | pumpkinseed | nan | 8.000000 | 125.400000 | 76 | Centrarchidae | Carnivore | nan |
Let’s create a data dictionary table with additional fields: “Meaningful Name,” “Data Type,” and “Data Usage Type.” Then, pass it to the model again to observe any changes in the quality of the generated data.
Column Name | Meaningful Name | Description | Data Type | Data Usage Type |
---|---|---|---|---|
site | Site ID | Unique 3-4 digit site ID code. | String | Attribute |
name | Site name | Wetland site name. | String | Dimension Attribute |
subreg | Sub-region site name | Assigned sub-region for a given wetland site | String | Dimension Attribute |
lake | Name of lake | Lake a given wetland site is connected to. | String | Dimension Attribute |
date | Date | Date of data collection (dd/mm/yyyy). | Date | Attribute |
yr | Year | Year of data collection. Additional years included that were not used in final study. | Number | Attribute |
lat | Latitude | General site latitude coordinate (decimal degrees). | Number | Attribute |
long | Longitude | General site longitude coordinate (decimal degrees). | Number | Attribute |
len | Length | Total length of individual fish (mm). | Number | Fact |
sp | Species | Unique 4 digit species code. All species included, as well as those that could not be identified past genera (i.e. code ends in SP). | String | Dimension Foreign Key |
sciname | Scientific name | Scientific name of fish species. | String | Dimension Attribute |
comname | Common name | Common name of fish species. | String | Dimension Attribute |
xfish | Count of individuals of a specific fish species | Total count of individuals of a specific species over 15 individuals (see Seilheimer and Chow-Fraser, 2007). Value is for individuals caught in the fyke nets of a single site for that specific date. | Number | Fact |
xfish_len | Average length of individuals of a specific fish species | Mean total length of extra fish of a specific species (calculated from the len field of the previous 15 measured individuals).Value is for individuals caught in the fyke nets of a single site for that specific date. | Number | Fact |
spabun | Species abundance | Value is for # of individuals of a specific species caught in the fyke nets of a single site for that specific date. | Number | Fact |
fam | Family | Species family. | String | Dimension Attribute |
funcfeed | Functional feeding group | Individuals assigned functional feeding group, based on length to approximate age and therefore primary diet. | String | Dimension Attribute |
netsize | Size of fyke net | Size of fyke net. SF = small; LF = large fyke. (see Montocchio and Chow-Fraser, 2025). Blank cells = no data available. Note: not completely populated (data missing). | String | Dimension Attribute |
Prompt:
Can you please generate 5 sample data using the following data dictionary table,
| Column Name | Meaningful Name | Description | Data Type | Data Usage Type |
| ----------- | --------------- | ----------- | --------- | --------------- |
| `site` | Site ID | Unique 3-4 digit site ID code. | String | Attribute |
| `name` | Site name | Wetland site name. | String | Dimension Attribute |
| `subreg` | Sub-region site name | Assigned sub-region for a given wetland site | String | Dimension Attribute |
| `lake` | Name of lake | Lake a given wetland site is connected to. | String | Dimension Attribute |
| `date` | Date | Date of data collection (dd/mm/yyyy). | Date | Attribute |
| `yr` | Year | Year of data collection. Additional years included that were not used in final study. | Number | Attribute |
| `lat` | Latitude | General site latitude coordinate (decimal degrees). | Number | Attribute |
| `long` | Longitude | General site longitude coordinate (decimal degrees). | Number | Attribute |
| `len` | Length | Total length of individual fish (mm). | Number | Fact |
| `sp` | Species | Unique 4 digit species code. All species included, as well as those that could not be identified past genera (i.e. code ends in SP). | String | Dimension Foreign Key |
| `sciname` | Scientific name | Scientific name of fish species. | String | Dimension Attribute |
| `comname` | Common name | Common name of fish species. | String | Dimension Attribute |
| `xfish` | Count of individuals of a specific fish species | Total count of individuals of a specific species over 15 individuals (see Seilheimer and Chow-Fraser, 2007). Value is for individuals caught in the fyke nets of a single site for that specific date. | Number | Fact |
| `xfish_len` | Average length of individuals of a specific fish species | Mean total length of extra fish of a specific species (calculated from the len field of the previous 15 measured individuals).Value is for individuals caught in the fyke nets of a single site for that specific date. | Number | Fact |
| `spabun` | Species abundance | Value is for # of individuals of a specific species caught in the fyke nets of a single site for that specific date. | Number | Fact |
| `fam` | Family | Species family. | String | Dimension Attribute |
| `funcfeed` | Functional feeding group | Individuals assigned functional feeding group, based on length to approximate age and therefore primary diet. | String | Dimension Attribute |
| `netsize` | Size of fyke net | Size of fyke net. SF = small; LF = large fyke. (see Montocchio and Chow-Fraser, 2025). Blank cells = no data available. Note: not completely populated (data missing). | String | Dimension Attribute |
Response:
Here are five sample data entries based on the data dictionary table:
| site | name | subreg | lake | date | yr | lat | long | len | sp | sciname | comname | xfish | xfish_len | spabun | fam | funcfeed | netsize |
|------|----------------|--------|--------------|------------|------|---------|---------|-----|------|---------------------|-----------------|-------|-----------|--------|----------------|-------------|---------|
| 1012 | Green Marsh | NW | Lake Huron | 12/06/2023 | 2023 | 45.1234 | -81.4567| 285 | 0305 | *Micropterus dolomieu* | Smallmouth Bass | 20 | 290 | 20 | Centrarchidae | Piscivore | LF |
| 2048 | Willow Creek | NE | Lake Ontario | 20/07/2023 | 2023 | 44.5678 | -79.1234| 150 | 0418 | *Perca flavescens* | Yellow Perch | 18 | 155 | 18 | Percidae | Invertivore | SF |
| 3092 | Cedar Wetland | SE | Lake Erie | 01/08/2023 | 2023 | 42.2345 | -80.5678| 400 | 0150 | *Esox lucius* | Northern Pike | 12 | 410 | 12 | Esocidae | Piscivore | LF |
| 4105 | Blue Bay | SW | Lake Superior| 15/09/2023 | 2023 | 47.3456 | -87.2345| 120 | 0254 | *Lepomis gibbosus* | Pumpkinseed | 25 | 125 | 25 | Centrarchidae | Invertivore | SF |
| 5021 | Meadow Lagoon | NW | Lake Michigan| 05/06/2023 | 2023 | 44.7890 | -85.6789| 600 | 0112 | *Amia calva* | Bowfin | 8 | 590 | 8 | Amiidae | Piscivore | LF |
---
### Explanation:
- **site, name, subreg, lake, date, yr**: Provide site-specific details and the date/year of data collection.
- **lat, long**: Latitude and longitude in decimal degrees format.
- **len**: Total length of individual fish in millimeters.
- **sp, sciname, comname**: Unique species code, scientific name, and common name of the fish species.
- **xfish, xfish_len, spabun**: Number and mean length of extra individuals and species abundance count.
- **fam**: Family name of the species.
- **funcfeed**: Functional feeding group based on diet and age (Piscivore, Invertivore).
- **netsize**: Size of fyke net used (LF = large fyke, SF = small fyke).
Let me know if you want this in CSV format or adjusted further!
site | name | subreg | lake | date | yr | lat | long | len | sp | sciname | comname | xfish | xfish_len | spabun | fam | funcfeed | netsize |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1012 | Green Marsh | NW | Lake Huron | 12/06/2023 | 2023 | 45.1234 | -81.4567 | 285 | 0305 | Micropterus dolomieu | Smallmouth Bass | 20 | 290 | 20 | Centrarchidae | Piscivore | LF |
2048 | Willow Creek | NE | Lake Ontario | 20/07/2023 | 2023 | 44.5678 | -79.1234 | 150 | 0418 | Perca flavescens | Yellow Perch | 18 | 155 | 18 | Percidae | Invertivore | SF |
3092 | Cedar Wetland | SE | Lake Erie | 01/08/2023 | 2023 | 42.2345 | -80.5678 | 400 | 0150 | Esox lucius | Northern Pike | 12 | 410 | 12 | Esocidae | Piscivore | LF |
4105 | Blue Bay | SW | Lake Superior | 15/09/2023 | 2023 | 47.3456 | -87.2345 | 120 | 0254 | Lepomis gibbosus | Pumpkinseed | 25 | 125 | 25 | Centrarchidae | Invertivore | SF |
5021 | Meadow Lagoon | NW | Lake Michigan | 05/06/2023 | 2023 | 44.7890 | -85.6789 | 600 | 0112 | Amia calva | Bowfin | 8 | 590 | 8 | Amiidae | Piscivore | LF |
The new data sample does not show significant improvement over the previous one. The discrepancies found in the previous generation still persist.
What if we add the final field, “Sample Values,” and update the descriptions of some variables? In this latest data dictionary, we corrected the descriptions for site
and sp
, added the degree of accuracy for lat
and long
, cited the larval fish special case for sciname
, and provided value ranges for the quantitative data fields.
Column Name | Meaningful Name | Description | Data Type | Data Usage Type | Sample Values |
---|---|---|---|---|---|
site | Site ID | Unique | String | Attribute | e.g., ‘MER’, ‘HG’, ‘STS’, ‘QI’, ‘LY1’, ‘OJ’, ‘NB’, ‘GY’, ‘LSP’, ‘RS’, ‘TD2’, ‘TD1’, ‘BLR’, ‘DV’, ‘MNC’, ‘SHR’, ‘HRM’, ‘GC’, ‘HB2’, ‘HB3’, ‘HB1’, ‘LACin’, ‘DF’, ‘IQ’, ‘DH’ |
name | Site name | Wetland site name. | String | Dimension Attribute | e.g., ‘Mermaid Island’, ‘Hog Bay’, ‘Sturgeon Bay South’, ‘Quarry Island’, ‘Lily Pond 1’, ‘Ojibway Bay’, ‘North Bay’, ‘Ganyon Bay’, ‘Lake St. Patrick’, ‘Roseborough’, ‘Tadenac Bay 2’, ‘Tadenac Bay 1’, ‘Black Rock’, ‘David’s Bay’, ‘Miner’s Creek’, ‘Shawanaga River’, ‘Hermann’s Bay’, ‘Garden Channel’, ‘Hay Bay 2’, ‘Hay Bay 3’, ‘Hay Bay 1’, ‘La Cloche (inner)’, ‘Dogfish Bay’, ‘Iroquois Bay’, ‘Dead Horse Bay’ |
subreg | Sub-region site name | Assigned sub-region for a given wetland site | String | Dimension Attribute | e.g., ‘Honey Harbour’, ‘Severn Sound’, ‘Beausoleil Island’, ‘North Bay’, ‘Cognashene’, ‘Go Home Bay’, ‘Tadenac Bay’, ‘Wah Wah Tay See’, ‘Pointe au Baril’, ‘Twelve Mile Bay’, ‘Sans Souci’, ‘Fathom Five’, ‘Franklin Island’, ‘Naiscoot’, ‘Key River’, ‘Killarney’, ‘French River’, ‘Bay of Islands’, ‘McGregor Bay’ |
lake | Name of lake | Lake a given wetland site is connected to. | String | Dimension Attribute | e.g., ‘Georgian Bay’, ‘Huron’ |
date | Date | Date of data collection (dd/mm/yyyy), ranging from 10/07/2003 to 19/07/2021. | Date | Attribute | e.g., ‘19/07/2021’, ‘08/07/2015’, ‘08/06/2003’, ‘27/05/2004’ |
yr | Year | Year of data collection, ranging from 2003 to 2021. Additional years included that were not used in final study. | Number | Attribute | e.g., 2021, 2015, 2003, 2004 |
lat | Latitude | General site latitude coordinate ( | Number | Attribute | e.g., 44.87908, 44.73134, 44.73148 |
long | Longitude | General site longitude coordinate ( | Number | Attribute | e.g., -79.81743, -79.80286, -79.80449 |
len | Length | Total length of individual fish (mm), ranging from 4mm to 1180mm. | Number | Fact | e.g., null, 115, 76, 98 |
sp | Species | Unique 4 | String | Dimension Foreign Key | e.g., ‘LEGI’, ‘LEMA’, ‘ICPU’, ‘COBA’, ‘LARV’ |
sciname | Scientific name | Scientific name of fish species. This name is null if sp is ‘LARV’ or/and comname is ‘Larval fish’ | String | Dimension Attribute | e.g., ‘Lepomis gibbonus’, ‘Lepomis macrochirus’, ‘Icalurus punctatus’, ‘Cottus bairdi’, ‘null’ |
comname | Common name | Common name of fish species. | String | Dimension Attribute | e.g., ‘pumpkinseed’, ‘bluegill’, ‘channel catfish’, ‘mottled sculpin’, ‘Larval fish’ |
xfish | Count of individuals of a specific fish species | Total count of individuals of a specific species over 15 individuals (see Seilheimer and Chow-Fraser, 2007), ranging from 0 to 801. Value is for individuals caught in the fyke nets of a single site for that specific date. | Number | Fact | e.g., null, 21, 60, 45 |
xfish_len | Average length of individuals of a specific fish species | Mean total length of extra fish of a specific species (calculated from the len field of the previous 15 measured individuals), ranging from 18mm to 289.875mm. Value is for individuals caught in the fyke nets of a single site for that specific date. | Number | Fact | e.g., null, 78.13, 103.89, 80, 88, 107.13 |
spabun | Species abundance | Value is for # of individuals of a specific species caught in the fyke nets of a single site for that specific date, ranging from 1 to 801. | Number | Fact | e.g., 74, 141, 2, 7, 4, 1 |
fam | Family | Species family. | String | Dimension Attribute | e.g., ‘Centrarchidae’, ‘Amiidae’, ‘Percidae’, ‘Gobiidae’, ‘Ictaluridae’, ‘Cyprinidae’, ‘Lepisosteidae’, ‘Gasterosteidae’, ‘Cyprinodontidae’, ‘Esocidae’, ‘Clupeidae’, ‘Catostomidae’, ‘Atherinidae’, ‘Umbridae’, ‘Sciaenidae’, ‘Cottidae’ |
funcfeed | Functional feeding group | Individuals assigned functional feeding group, based on length to approximate age and therefore primary diet. | String | Dimension Attribute | e.g., ‘Carnivore’, ‘Piscivore’, ‘Benthivore’, ‘Omnivore’, ‘Planktivore’, ‘null’ |
netsize | Size of fyke net | Size of fyke net, e.g. small (SF), large (LF). | String | Dimension Attribute | e.g., ‘null’, ‘LF’, ‘SF’, ‘LF MNR’, ‘SF MNR’ |
PROMPT:
Can you please generate 10 sample data using the following data dictionary table,
| Column Name | Meaningful Name | Description | Data Type | Data Usage Type | Sample Values |
| ----------- | --------------- | ----------- | --------- | --------------- | ------------- |
| `site` | Site ID | Unique 2-5 alphanumeric site ID code of abbreviated site name. | String | Attribute | e.g., 'MER', 'HG', 'STS', 'QI', 'LY1', 'OJ', 'NB', 'GY', 'LSP', 'RS', 'TD2', 'TD1', 'BLR', 'DV', 'MNC', 'SHR', 'HRM', 'GC', 'HB2', 'HB3', 'HB1', 'LACin', 'DF', 'IQ', 'DH' |
| `name` | Site name | Wetland site name. | String | Dimension Attribute | e.g., 'Mermaid Island', 'Hog Bay', 'Sturgeon Bay South', 'Quarry Island', 'Lily Pond 1', 'Ojibway Bay', 'North Bay', 'Ganyon Bay', 'Lake St. Patrick', 'Roseborough', 'Tadenac Bay 2', 'Tadenac Bay 1', 'Black Rock', 'David's Bay', 'Miner's Creek', 'Shawanaga River', 'Hermann's Bay', 'Garden Channel', 'Hay Bay 2', 'Hay Bay 3', 'Hay Bay 1', 'La Cloche (inner)', 'Dogfish Bay', 'Iroquois Bay', 'Dead Horse Bay' |
| `subreg` | Sub-region site name | Assigned sub-region for a given wetland site | String | Dimension Attribute | e.g., 'Honey Harbour', 'Severn Sound', 'Beausoleil Island', 'North Bay', 'Cognashene', 'Go Home Bay', 'Tadenac Bay', 'Wah Wah Tay See', 'Pointe au Baril', 'Twelve Mile Bay', 'Sans Souci', 'Fathom Five', 'Franklin Island', 'Naiscoot', 'Key River', 'Killarney', 'French River', 'Bay of Islands', 'McGregor Bay' |
| `lake` | Name of lake | Lake a given wetland site is connected to. | String | Dimension Attribute | e.g., 'Georgian Bay', 'Huron' |
| `date` | Date | Date of data collection (dd/mm/yyyy), ranging from 10/07/2003 to 19/07/2021. | Date | Attribute | e.g., '19/07/2021', '08/07/2015', '08/06/2003', '27/05/2004' |
| `yr` | Year | Year of data collection, ranging from 2003 to 2021. Additional years included that were not used in final study. | Number | Attribute | e.g., 2021, 2015, 2003, 2004 |
| `lat` | Latitude | General site latitude coordinate (to the 5 decimal places). | Number | Attribute | e.g., 44.87908, 44.73134, 44.73148 |
| `long` | Longitude | General site longitude coordinate (to the 5 decimal places). | Number | Attribute | e.g., -79.81743, -79.80286, -79.80449 |
| `len` | Length | Total length of individual fish (mm), ranging from 4mm to 1180mm. | Number | Fact | e.g., null, 115, 76, 98 |
| `sp` | Species | Unique 4 alphabetic species code of abbreviated scientific species name. All species included, as well as those that could not be identified past genera (i.e. code ends in SP). | String | Dimension Foreign Key | e.g., 'LEGI', 'LEMA', 'ICPU', 'COBA', 'LARV' |
| `sciname` | Scientific name | Scientific name of fish species. This name is null if `sp` is 'LARV' or/and `comname` is 'Larval fish' | String | Dimension Attribute | e.g., 'Lepomis gibbonus', 'Lepomis macrochirus', 'Icalurus punctatus', 'Cottus bairdi', 'null' |
| `comname` | Common name | Common name of fish species. | String | Dimension Attribute | e.g., 'pumpkinseed', 'bluegill', 'channel catfish', 'mottled sculpin', 'Larval fish' |
| `xfish` | Count of individuals of a specific fish species | Total count of individuals of a specific species over 15 individuals (see Seilheimer and Chow-Fraser, 2007), ranging from 0 to 801. Value is for individuals caught in the fyke nets of a single site for that specific date. | Number | Fact | e.g., null, 21, 60, 45 |
| `xfish_len` | Average length of individuals of a specific fish species | Mean total length of extra fish of a specific species (calculated from the len field of the previous 15 measured individuals), ranging from 18mm to 289.875mm. Value is for individuals caught in the fyke nets of a single site for that specific date. | Number | Fact | e.g., null, 78.13, 103.89, 80, 88, 107.13 |
| `spabun` | Species abundance | Value is for # of individuals of a specific species caught in the fyke nets of a single site for that specific date, ranging from 1 to 801. | Number | Fact | e.g., 74, 141, 2, 7, 4, 1 |
| `fam` | Family | Species family. | String | Dimension Attribute | e.g., 'Centrarchidae', 'Amiidae', 'Percidae', 'Gobiidae', 'Ictaluridae', 'Cyprinidae', 'Lepisosteidae', 'Gasterosteidae', 'Cyprinodontidae', 'Esocidae', 'Clupeidae', 'Catostomidae', 'Atherinidae', 'Umbridae', 'Sciaenidae', 'Cottidae' |
| `funcfeed` | Functional feeding group | Individuals assigned functional feeding group, based on length to approximate age and therefore primary diet. | String | Dimension Attribute | e.g., 'Carnivore', 'Piscivore', 'Benthivore', 'Omnivore', 'Planktivore', 'null' |
| `netsize` | Size of fyke net | Size of fyke net, e.g. small (SF), large (LF). (see Montocchio and Chow-Fraser, 2025). Blank cells = no data available. Note: not completely populated (data missing). | String | Dimension Attribute | e.g., 'null', 'LF', 'SF', 'LF MNR', 'SF MNR' |
Response:
Here are 10 sample data entries based on your provided data dictionary table:
| site | name | subreg | lake | date | yr | lat | long | len | sp | sciname | comname | xfish | xfish_len | spabun | fam | funcfeed | netsize |
|-------|-------------------|----------------|--------------|------------|------|----------|-----------|------|------|---------------------|----------------|-------|-----------|--------|----------------|------------|---------|
| MER | Mermaid Island | Honey Harbour | Georgian Bay | 12/07/2021 | 2021 | 44.87908 | -79.81743 | 115 | LEGI | *Lepomis gibbonus* | pumpkinseed | 21 | 103.89 | 74 | Centrarchidae | Carnivore | LF |
| HG | Hog Bay | Severn Sound | Georgian Bay | 08/06/2003 | 2003 | 44.73134 | -79.80286 | 76 | LEMA | *Lepomis macrochirus* | bluegill | 60 | 78.13 | 141 | Centrarchidae | Omnivore | SF |
| STS | Sturgeon Bay South| Beausoleil Island | Georgian Bay | 27/05/2004 | 2004 | 44.73148 | -79.80449 | 98 | ICPU | *Icalurus punctatus* | channel catfish | 45 | 107.13 | 2 | Ictaluridae | Benthivore | LF |
| QI | Quarry Island | North Bay | Huron | 10/07/2010 | 2010 | 44.65023 | -80.12345 | null | LARV | null | Larval fish | null | null | 7 | Cyprinidae | null | SF MNR |
| LY1 | Lily Pond 1 | Cognashene | Georgian Bay | 15/07/2015 | 2015 | 44.56789 | -79.56789 | 120 | COBA | *Cottus bairdi* | mottled sculpin | 18 | 80 | 4 | Cottidae | Carnivore | SF |
| OJ | Ojibway Bay | Go Home Bay | Georgian Bay | 05/06/2005 | 2005 | 44.89101 | -79.34567 | 289 | LEGI | *Lepomis gibbonus* | pumpkinseed | 801 | 289.875 | 801 | Centrarchidae | Planktivore| LF MNR |
| NB | North Bay | North Bay | Huron | 18/07/2019 | 2019 | 44.90876 | -80.34567 | 140 | LEMA | *Lepomis macrochirus* | bluegill | 25 | 88 | 25 | Centrarchidae | Omnivore | null |
| GY | Ganyon Bay | Tadenac Bay | Georgian Bay | 20/07/2007 | 2007 | 44.12345 | -80.23456 | 78 | ICPU | *Icalurus punctatus* | channel catfish | 19 | 107.13 | 19 | Ictaluridae | Benthivore | LF |
| LSP | Lake St. Patrick | Bay of Islands | Huron | 30/06/2008 | 2008 | 44.56780 | -80.12378 | null | LARV | null | Larval fish | null | null | 1 | Cyprinidae | null | SF |
| RS | Roseborough | French River | Georgian Bay | 11/07/2020 | 2020 | 44.90123 | -80.78901 | 190 | COBA | *Cottus bairdi* | mottled sculpin | 35 | 103.89 | 35 | Cottidae | Piscivore | LF |
---
### Explanation:
- **site, name, subreg, lake, date, yr**: Unique identifiers and site details with sample values based on the data dictionary.
- **lat, long**: Sample latitude and longitude values up to 5 decimal places.
- **len**: Fish length is either an integer between 4mm and 1180mm or null for missing data.
- **sp, sciname, comname**: Species code, scientific name, and common name of fish. Some entries have `null` for `sciname` and `Larval fish` as the `comname`.
- **xfish, xfish_len, spabun**: Count, average length, and abundance of fish, with null values for some fields.
- **fam, funcfeed**: Family and functional feeding group with realistic distributions (including `null` for missing data).
- **netsize**: Different sizes of fyke nets, including blanks (`null`).
Let me know if you want this in CSV format!
site | name | subreg | lake | date | yr | lat | long | len | sp | sciname | comname | xfish | xfish_len | spabun | fam | funcfeed | netsize |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MER | Mermaid Island | Honey Harbour | Georgian Bay | 12/07/2021 | 2021 | 44.87908 | -79.81743 | 115 | LEGI | Lepomis gibbonus | pumpkinseed | 21 | 103.89 | 74 | Centrarchidae | Carnivore | LF |
HG | Hog Bay | Severn Sound | Georgian Bay | 08/06/2003 | 2003 | 44.73134 | -79.80286 | 76 | LEMA | Lepomis macrochirus | bluegill | 60 | 78.13 | 141 | Centrarchidae | Omnivore | SF |
STS | Sturgeon Bay South | Beausoleil Island | Georgian Bay | 27/05/2004 | 2004 | 44.73148 | -79.80449 | 98 | ICPU | Icalurus punctatus | channel catfish | 45 | 107.13 | 2 | Ictaluridae | Benthivore | LF |
QI | Quarry Island | North Bay | Huron | 10/07/2010 | 2010 | 44.65023 | -80.12345 | null | LARV | null | Larval fish | null | null | 7 | Cyprinidae | null | SF MNR |
LY1 | Lily Pond 1 | Cognashene | Georgian Bay | 15/07/2015 | 2015 | 44.56789 | -79.56789 | 120 | COBA | Cottus bairdi | mottled sculpin | 18 | 80 | 4 | Cottidae | Carnivore | SF |
OJ | Ojibway Bay | Go Home Bay | Georgian Bay | 05/06/2005 | 2005 | 44.89101 | -79.34567 | 289 | LEGI | Lepomis gibbonus | pumpkinseed | 801 | 289.875 | 801 | Centrarchidae | Planktivore | LF MNR |
NB | North Bay | North Bay | Huron | 18/07/2019 | 2019 | 44.90876 | -80.34567 | 140 | LEMA | Lepomis macrochirus | bluegill | 25 | 88 | 25 | Centrarchidae | Omnivore | null |
GY | Ganyon Bay | Tadenac Bay | Georgian Bay | 20/07/2007 | 2007 | 44.12345 | -80.23456 | 78 | ICPU | Icalurus punctatus | channel catfish | 19 | 107.13 | 19 | Ictaluridae | Benthivore | LF |
LSP | Lake St. Patrick | Bay of Islands | Huron | 30/06/2008 | 2008 | 44.56780 | -80.12378 | null | LARV | null | Larval fish | null | null | 1 | Cyprinidae | null | SF |
RS | Roseborough | French River | Georgian Bay | 11/07/2020 | 2020 | 44.90123 | -80.78901 | 190 | COBA | Cottus bairdi | mottled sculpin | 35 | 103.89 | 35 | Cottidae | Piscivore | LF |
With the updated descriptions and the addition of the “Sample Values” column, the sample data generated by the model closely resembles data from the original dataset. The model’s outputs reflect how a human user would interpret the data using the data dictionary. This demonstrates that pulbicly available generative artificial intelligence tools utilizing large language models are valuable for evalutaing the quality of research data documentation.
Additionally, one could directly ask these generative artificial intelligence tools for improvement suggestions or request the model to provide database querying instructions (e.g. SQL statements) based on its understanding of the data’s structure and organization using the dictionary.