Skip to main content Link Menu Expand (external link) Left Arrow Right Arrow Document Search Copy Copied

Data Dictionary

A data dictionary or a data codebook is a document directly describing data elements and variables. It is an essential component in facilitating FAIR research data management by helping others to better understand and potentially reuse the datasets. It is typically published alongside the research data in the form of a README file, which describes the files, their organization, and provides an overview of the methodologies used for data collection.

Data dictionaries and codebooks describe data variables and structures, including variable names, descriptions, value ranges, and units. These elements prevent misinterpretation of data in different cultural contexts from the original authors and help others better understand the datasets for future reuse.

What should be in a data dictionary

A good data dictionary or codebook should provide sufficient information about each variable to be self-explanatory and easily interpreted by others. There are various schools of thought about the essential attributes in a dictionary, such as OSF: How to Make a Data Dictionary, University of Pennsylvania Libraries Data Management Resources on Codebooks & Data Dictionaries. We find that the resource by UBC Library Research Commons includes an adequate set of “must-have” attributes to include in a data dictionary for tabular datasets.

  • Column (variable) name: machine-readable name of the variable labelled in the dataset
  • Column name in plain English: human-readable name of the variable
  • Description of the Column: a brief description of what the variable represents
  • Data type: type of data stored in the column (e.g., integer, string, datetime)
  • Data usage type: inteneded use or role of data column
  • Sample values: examples of values that might be found in the column

Evaluating and Improving Quality of Data Dictionary

Data documentation is essential for the reproducibility, reuse and understanding of research data. A data dictionary, which describes elements and variables in a human-readable format, is an important aspect of data documentation. However, even with a data dictionary, poor documentation quality can still hinder the ability of other researchers to understand and re-use these data in the future.

Even with a data dictionary that includes essential attributes and more, one might still question whether the data dictionary is easy to understand and if the variable descriptions are clear enough for others or the descriptions of the data type are concise enough. In other words, is there an objective approach to evaluate the quality of the dictionary?

With the emergence of generative artificial intelligence and large language models, machines are now capable of interpreting and responding to information in natural language with high accuracy and confidence. These capabilities can be leveraged to evaluate data dictionaries by interpreting the provided information and generating new insights or content.

Let’s look at an example.

Montocchio, Danielle, 2025, “Data for “Effects of two decades of atypical water levels on fish abundance, diversity, and species composition in Georgian Bay coastal wetlands””, https://doi.org/10.5683/SP3/0SJGRA, Borealis, V1

The repository of this dataset includes a README file that describes the data files and their organization, as well as a simple data dictionary with variable labels and descriptions.

3.1 FOR: [fish_data.csv] 
 
Number of variables: 18 
Number of cases/rows: 13883 

Variable List:  

    site - Unique 3-4 digit site ID code. 
    name - Wetland site name. 
    subreg - Assigned sub-region for a given wetland site 
    lake - Lake a given wetland site is connected to. 
    date - Date of data collection (dd/mm/yyyy). 
    yr - Year of data collection. Additional years included that were not used in final study. 
    lat - General site latitude coordinate (decimal degrees). 
    long - General site longitude coordinate (decimal degrees). 
    sp - Unique 4 digit species code. All species included, as well as those that could not be identified past genera (i.e. code ends in SP). 
    sciname - Scientific name of fish species. 
    comname - Common name of fish species. 
    len - Total length of individual fish (mm). 
    xfish - Total count of individuals of a specific species over 15 individuals (see Seilheimer and Chow-Fraser, 2007). Value is for individuals caught in the fyke nets of a single site for that specific date. 
    xfish_len - Mean total length of extra fish of a specific species (calculated from the len field of the previous 15 measured individuals).Value is for individuals caught in the fyke nets of a single site for that specific date. 
    spabun - Species abundance. Value is for # of individuals of a specific species caught in the fyke nets of a single site for that specific date. 
    fam - Species family. 
    funcfeed - Individuals assigned functional feeding group, based on length to approximate age and therefore primary diet. 
    netsize - Size of fyke net. SF = small; LF = large fyke. (see Montocchio and Chow-Fraser, 2025). Blank cells = no data available **Note: not completely populated (data missing).** 

Let’s pass this data dictionary to a large language model (LLM), such as ChatGPT, using the following prompt:

Can you please generate 5 sample data using the following data dictionary list,

3.1 FOR: [fish_data.csv] 
 
Number of variables: 18 
Number of cases/rows: 13883 

Variable List:  

    site - Unique 3-4 digit site ID code. 
    name - Wetland site name. 
    subreg - Assigned sub-region for a given wetland site 
    lake - Lake a given wetland site is connected to. 
    date - Date of data collection (dd/mm/yyyy). 
    yr - Year of data collection. Additional years included that were not used in final study. 
    lat - General site latitude coordinate (decimal degrees). 
    long - General site longitude coordinate (decimal degrees). 
    sp - Unique 4 digit species code. All species included, as well as those that could not be identified past genera (i.e. code ends in SP). 
    sciname - Scientific name of fish species. 
    comname - Common name of fish species. 
    len - Total length of individual fish (mm). 
    xfish - Total count of individuals of a specific species over 15 individuals (see Seilheimer and Chow-Fraser, 2007). Value is for individuals caught in the fyke nets of a single site for that specific date. 
    xfish_len - Mean total length of extra fish of a specific species (calculated from the len field of the previous 15 measured individuals).Value is for individuals caught in the fyke nets of a single site for that specific date. 
    spabun - Species abundance. Value is for # of individuals of a specific species caught in the fyke nets of a single site for that specific date. 
    fam - Species family. 
    funcfeed - Individuals assigned functional feeding group, based on length to approximate age and therefore primary diet. 
    netsize - Size of fyke net. SF = small; LF = large fyke. (see Montocchio and Chow-Fraser, 2025). Blank cells = no data available **Note: not completely populated (data missing).** 

And observe the resulting response.

Here are five sample rows of data for your **fish_data.csv** based on the data dictionary you provided:

| site  | name       | subreg | lake     | date       | yr   | lat       | long      | sp   | sciname           | comname      | len  | xfish | xfish_len | spabun | fam            | funcfeed   | netsize |
|-------|------------|--------|----------|------------|------|-----------|-----------|------|------------------|--------------|------|-------|-----------|--------|----------------|------------|---------|
| 1023  | Marsh Bay  | NW     | Lake Huron | 15/06/2023 | 2023 | 45.1234  | -81.5678 | 0301 | *Micropterus salmoides* | Largemouth Bass | 305  | 18    | 312       | 18     | Centrarchidae  | Piscivore  | LF      |
| 2045  | Willow Wetland | SE   | Lake Erie | 07/07/2023 | 2023 | 42.6589  | -80.1234 | 0412 | *Perca flavescens* | Yellow Perch | 145  | 22    | 150       | 22     | Percidae       | Invertivore| SF      |
| 3087  | Cedar Marsh | NW     | Lake Ontario | 20/08/2023 | 2023 | 43.2034  | -79.4567 | 0156 | *Esox lucius*     | Northern Pike | 420  | 25    | 430       | 25     | Esocidae       | Piscivore  | LF      |
| 4120  | Bay Wetland | NE     | Lake Superior | 10/09/2023 | 2023 | 47.9001  | -87.2345 | 0250 | *Lepomis macrochirus* | Bluegill       | 130  | 16    | 135       | 16     | Centrarchidae  | Invertivore| SF      |
| 5032  | Meadow Wetland | SW   | Lake Michigan | 05/06/2023 | 2023 | 44.6789  | -85.9876 | 0110 | *Amia calva*      | Bowfin        | 610  | 10    | 600       | 10     | Amiidae        | Piscivore  | LF      |

### Explanation:
- **site:** Unique site ID ranging from 3 to 4 digits.  
- **name:** Wetland site name.  
- **subreg:** Sub-region identifier (NW, SE, NE, SW).  
- **lake:** One of the Great Lakes.  
- **date/yr:** Dates set within 2023 for consistency with recent data.  
- **lat/long:** Approximate decimal degrees for various wetland sites.  
- **sp:** Unique species code (randomly generated 4-digit number).  
- **sciname/comname:** Scientific and common names of fish species common to North America.  
- **len:** Total length of individual fish in mm.  
- **xfish/xfish_len:** Counts and mean lengths of extra individuals.  
- **spabun:** Species abundance, matching xfish count.  
- **fam:** Species family.  
- **funcfeed:** Functional feeding group based on diet (Piscivore, Invertivore).  
- **netsize:** Fyke net size (LF = Large Fyke; SF = Small Fyke).  

Let me know if you want this in CSV format!  
site name subreg lake date yr lat long sp sciname comname len xfish xfish_len spabun fam funcfeed netsize
1023 Marsh Bay NW Lake Huron 15/06/2023 2023 45.1234 -81.5678 0301 Micropterus salmoides Largemouth Bass 305 18 312 18 Centrarchidae Piscivore LF
2045 Willow Wetland SE Lake Erie 07/07/2023 2023 42.6589 -80.1234 0412 Perca flavescens Yellow Perch 145 22 150 22 Percidae Invertivore SF
3087 Cedar Marsh NW Lake Ontario 20/08/2023 2023 43.2034 -79.4567 0156 Esox lucius Northern Pike 420 25 430 25 Esocidae Piscivore LF
4120 Bay Wetland NE Lake Superior 10/09/2023 2023 47.9001 -87.2345 0250 Lepomis macrochirus Bluegill 130 16 135 16 Centrarchidae Invertivore SF
5032 Meadow Wetland SW Lake Michigan 05/06/2023 2023 44.6789 -85.9876 0110 Amia calva Bowfin 610 10 600 10 Amiidae Piscivore LF

This generated data sample looks quite indistinguishable with data from the actual dataset. However, there are still some discrepancies between the actual data and the generated data.

  • site/sp: Faulty generated data based on incorrect variable description provided
  • subreg: The model took assumption to interpret this field to be the bearing of the wetland land site based on the variable description provided
  • lat/long: Both fields allow floating point numbers to 5 decimal places
  • xfish/xfish_len: Both fields can include null values and they are floating point numbers
  • netsize: Net size field can include null values
site name subreg lake date yr lat long sp sciname comnam len xfish xfish_len spabun fam funcfeed netsize
MER Mermaid Island Honey Harbour Georgian Bay 19/07/2021 2021 44.879080 -79.817430 LEGI Lepomis gibbonus pumpkinseed nan 21.000000 78.130000 74 Centrarchidae Carnivore nan
MER Mermaid Island Honey Harbour Georgian Bay 19/07/2021 2021 44.879080 -79.817430 LEMA Lepomis macrochirus bluegill nan 60.000000 103.890000 141 Centrarchidae Carnivore nan
MER Mermaid Island Honey Harbour Georgian Bay 19/07/2021 2021 44.879080 -79.817430 LEMA Lepomis macrochirus bluegill nan 45.000000 80.000000 141 Centrarchidae Carnivore nan
MER Mermaid Island Honey Harbour Georgian Bay 19/07/2021 2021 44.879080 -79.817430 LEGI Lepomis gibbonus pumpkinseed nan 23.000000 88.000000 74 Centrarchidae Carnivore nan
HG Hog Bay Severn Sound Georgian Bay 08/07/2015 2015 44.731340 -79.802860 LEGI Lepomis gibbonus pumpkinseed nan 62.000000 107.130000 236 Centrarchidae Carnivore nan
CM Cormican Bay Franklin Island Georgian Bay 12/08/2003 2003 45.405881 -80.305820 LEGI Lepomis gibbonus pumpkinseed 97.000000 nan nan 84 Centrarchidae Carnivore SF
OJ Ojibway Bay Beausoleil Island Georgian Bay 24/06/2009 2009 44.887780 -79.855730 PEFL Perca flavescens yellow perch 92.000000 nan nan 8 Percidae Carnivore nan
MNC Miner's Creek Tadenac Bay Georgian Bay 25/07/2006 2006 45.060394 -79.951280 LEGI Lepomis gibbonus pumpkinseed 105.000000 nan nan 61 Centrarchidae Carnivore LF
HW Hole in the Wall Pointe au Baril Georgian Bay 01/08/2007 2007 45.521920 -80.438590 PINO Pimephales notatus bluntnose minnow 46.000000 nan nan 3 Cyprinidae Benthivore nan
KE Key River Key River Georgian Bay 10/08/2016 2016 45.885510 -80.676170 LEGI Lepomis gibbonus pumpkinseed nan 8.000000 125.400000 76 Centrarchidae Carnivore nan

Let’s create a data dictionary table with additional fields: “Meaningful Name,” “Data Type,” and “Data Usage Type.” Then, pass it to the model again to observe any changes in the quality of the generated data.

Column Name Meaningful Name Description Data Type Data Usage Type
site Site ID Unique 3-4 digit site ID code. String Attribute
name Site name Wetland site name. String Dimension Attribute
subreg Sub-region site name Assigned sub-region for a given wetland site String Dimension Attribute
lake Name of lake Lake a given wetland site is connected to. String Dimension Attribute
date Date Date of data collection (dd/mm/yyyy). Date Attribute
yr Year Year of data collection. Additional years included that were not used in final study. Number Attribute
lat Latitude General site latitude coordinate (decimal degrees). Number Attribute
long Longitude General site longitude coordinate (decimal degrees). Number Attribute
len Length Total length of individual fish (mm). Number Fact
sp Species Unique 4 digit species code. All species included, as well as those that could not be identified past genera (i.e. code ends in SP). String Dimension Foreign Key
sciname Scientific name Scientific name of fish species. String Dimension Attribute
comname Common name Common name of fish species. String Dimension Attribute
xfish Count of individuals of a specific fish species Total count of individuals of a specific species over 15 individuals (see Seilheimer and Chow-Fraser, 2007). Value is for individuals caught in the fyke nets of a single site for that specific date. Number Fact
xfish_len Average length of individuals of a specific fish species Mean total length of extra fish of a specific species (calculated from the len field of the previous 15 measured individuals).Value is for individuals caught in the fyke nets of a single site for that specific date. Number Fact
spabun Species abundance Value is for # of individuals of a specific species caught in the fyke nets of a single site for that specific date. Number Fact
fam Family Species family. String Dimension Attribute
funcfeed Functional feeding group Individuals assigned functional feeding group, based on length to approximate age and therefore primary diet. String Dimension Attribute
netsize Size of fyke net Size of fyke net. SF = small; LF = large fyke. (see Montocchio and Chow-Fraser, 2025). Blank cells = no data available. Note: not completely populated (data missing). String Dimension Attribute

Prompt:

Can you please generate 5 sample data using the following data dictionary table,

| Column Name | Meaningful Name | Description | Data Type | Data Usage Type |
| ----------- | --------------- | ----------- | --------- | --------------- |
| `site` | Site ID | Unique 3-4 digit site ID code. | String | Attribute |
| `name` | Site name | Wetland site name. | String | Dimension Attribute |
| `subreg` | Sub-region site name | Assigned sub-region for a given wetland site | String | Dimension Attribute |
| `lake` | Name of lake | Lake a given wetland site is connected to. | String | Dimension Attribute |
| `date` | Date | Date of data collection (dd/mm/yyyy). | Date | Attribute |
| `yr` | Year | Year of data collection. Additional years included that were not used in final study. | Number | Attribute |
| `lat` | Latitude | General site latitude coordinate (decimal degrees). | Number | Attribute |
| `long` | Longitude | General site longitude coordinate (decimal degrees). | Number | Attribute |
| `len` | Length | Total length of individual fish (mm). | Number | Fact |
| `sp`  | Species | Unique 4 digit species code. All species included, as well as those that could not be identified past genera (i.e. code ends in SP). | String | Dimension Foreign Key |
| `sciname` | Scientific name | Scientific name of fish species. | String | Dimension Attribute |
| `comname` | Common name | Common name of fish species. | String | Dimension Attribute |
| `xfish` | Count of individuals of a specific fish species | Total count of individuals of a specific species over 15 individuals (see Seilheimer and Chow-Fraser, 2007). Value is for individuals caught in the fyke nets of a single site for that specific date. | Number | Fact |
| `xfish_len` | Average length of individuals of a specific fish species | Mean total length of extra fish of a specific species (calculated from the len field of the previous 15 measured individuals).Value is for individuals caught in the fyke nets of a single site for that specific date. | Number | Fact |
| `spabun` | Species abundance | Value is for # of individuals of a specific species caught in the fyke nets of a single site for that specific date. | Number | Fact |
| `fam` | Family | Species family. | String | Dimension Attribute |
| `funcfeed` | Functional feeding group | Individuals assigned functional feeding group, based on length to approximate age and therefore primary diet. | String | Dimension Attribute |
| `netsize` | Size of fyke net | Size of fyke net. SF = small; LF = large fyke. (see Montocchio and Chow-Fraser, 2025). Blank cells = no data available. Note: not completely populated (data missing). | String | Dimension Attribute |

Response:

Here are five sample data entries based on the data dictionary table:  

| site | name           | subreg | lake         | date       | yr   | lat     | long    | len | sp   | sciname              | comname          | xfish | xfish_len | spabun | fam            | funcfeed    | netsize |
|------|----------------|--------|--------------|------------|------|---------|---------|-----|------|---------------------|-----------------|-------|-----------|--------|----------------|-------------|---------|
| 1012 | Green Marsh    | NW     | Lake Huron   | 12/06/2023 | 2023 | 45.1234 | -81.4567| 285 | 0305 | *Micropterus dolomieu* | Smallmouth Bass | 20    | 290       | 20     | Centrarchidae  | Piscivore   | LF      |
| 2048 | Willow Creek   | NE     | Lake Ontario | 20/07/2023 | 2023 | 44.5678 | -79.1234| 150 | 0418 | *Perca flavescens*   | Yellow Perch    | 18    | 155       | 18     | Percidae       | Invertivore | SF      |
| 3092 | Cedar Wetland  | SE     | Lake Erie    | 01/08/2023 | 2023 | 42.2345 | -80.5678| 400 | 0150 | *Esox lucius*        | Northern Pike   | 12    | 410       | 12     | Esocidae       | Piscivore   | LF      |
| 4105 | Blue Bay       | SW     | Lake Superior| 15/09/2023 | 2023 | 47.3456 | -87.2345| 120 | 0254 | *Lepomis gibbosus*   | Pumpkinseed     | 25    | 125       | 25     | Centrarchidae  | Invertivore | SF      |
| 5021 | Meadow Lagoon  | NW     | Lake Michigan| 05/06/2023 | 2023 | 44.7890 | -85.6789| 600 | 0112 | *Amia calva*         | Bowfin          | 8     | 590       | 8      | Amiidae        | Piscivore   | LF      |

---

### Explanation:
- **site, name, subreg, lake, date, yr**: Provide site-specific details and the date/year of data collection.  
- **lat, long**: Latitude and longitude in decimal degrees format.  
- **len**: Total length of individual fish in millimeters.  
- **sp, sciname, comname**: Unique species code, scientific name, and common name of the fish species.  
- **xfish, xfish_len, spabun**: Number and mean length of extra individuals and species abundance count.  
- **fam**: Family name of the species.  
- **funcfeed**: Functional feeding group based on diet and age (Piscivore, Invertivore).  
- **netsize**: Size of fyke net used (LF = large fyke, SF = small fyke).  

Let me know if you want this in CSV format or adjusted further!
site name subreg lake date yr lat long len sp sciname comname xfish xfish_len spabun fam funcfeed netsize
1012 Green Marsh NW Lake Huron 12/06/2023 2023 45.1234 -81.4567 285 0305 Micropterus dolomieu Smallmouth Bass 20 290 20 Centrarchidae Piscivore LF
2048 Willow Creek NE Lake Ontario 20/07/2023 2023 44.5678 -79.1234 150 0418 Perca flavescens Yellow Perch 18 155 18 Percidae Invertivore SF
3092 Cedar Wetland SE Lake Erie 01/08/2023 2023 42.2345 -80.5678 400 0150 Esox lucius Northern Pike 12 410 12 Esocidae Piscivore LF
4105 Blue Bay SW Lake Superior 15/09/2023 2023 47.3456 -87.2345 120 0254 Lepomis gibbosus Pumpkinseed 25 125 25 Centrarchidae Invertivore SF
5021 Meadow Lagoon NW Lake Michigan 05/06/2023 2023 44.7890 -85.6789 600 0112 Amia calva Bowfin 8 590 8 Amiidae Piscivore LF

The new data sample does not show significant improvement over the previous one. The discrepancies found in the previous generation still persist.

What if we add the final field, “Sample Values,” and update the descriptions of some variables? In this latest data dictionary, we corrected the descriptions for site and sp, added the degree of accuracy for lat and long, cited the larval fish special case for sciname, and provided value ranges for the quantitative data fields.

Column Name Meaningful Name Description Data Type Data Usage Type Sample Values
site Site ID Unique 3-4 2-5 digit alphanumeric site ID code of abbreviated site name. String Attribute e.g., ‘MER’, ‘HG’, ‘STS’, ‘QI’, ‘LY1’, ‘OJ’, ‘NB’, ‘GY’, ‘LSP’, ‘RS’, ‘TD2’, ‘TD1’, ‘BLR’, ‘DV’, ‘MNC’, ‘SHR’, ‘HRM’, ‘GC’, ‘HB2’, ‘HB3’, ‘HB1’, ‘LACin’, ‘DF’, ‘IQ’, ‘DH’
name Site name Wetland site name. String Dimension Attribute e.g., ‘Mermaid Island’, ‘Hog Bay’, ‘Sturgeon Bay South’, ‘Quarry Island’, ‘Lily Pond 1’, ‘Ojibway Bay’, ‘North Bay’, ‘Ganyon Bay’, ‘Lake St. Patrick’, ‘Roseborough’, ‘Tadenac Bay 2’, ‘Tadenac Bay 1’, ‘Black Rock’, ‘David’s Bay’, ‘Miner’s Creek’, ‘Shawanaga River’, ‘Hermann’s Bay’, ‘Garden Channel’, ‘Hay Bay 2’, ‘Hay Bay 3’, ‘Hay Bay 1’, ‘La Cloche (inner)’, ‘Dogfish Bay’, ‘Iroquois Bay’, ‘Dead Horse Bay’
subreg Sub-region site name Assigned sub-region for a given wetland site String Dimension Attribute e.g., ‘Honey Harbour’, ‘Severn Sound’, ‘Beausoleil Island’, ‘North Bay’, ‘Cognashene’, ‘Go Home Bay’, ‘Tadenac Bay’, ‘Wah Wah Tay See’, ‘Pointe au Baril’, ‘Twelve Mile Bay’, ‘Sans Souci’, ‘Fathom Five’, ‘Franklin Island’, ‘Naiscoot’, ‘Key River’, ‘Killarney’, ‘French River’, ‘Bay of Islands’, ‘McGregor Bay’
lake Name of lake Lake a given wetland site is connected to. String Dimension Attribute e.g., ‘Georgian Bay’, ‘Huron’
date Date Date of data collection (dd/mm/yyyy), ranging from 10/07/2003 to 19/07/2021. Date Attribute e.g., ‘19/07/2021’, ‘08/07/2015’, ‘08/06/2003’, ‘27/05/2004’
yr Year Year of data collection, ranging from 2003 to 2021. Additional years included that were not used in final study. Number Attribute e.g., 2021, 2015, 2003, 2004
lat Latitude General site latitude coordinate (decimal degrees to the 5 decimal places). Number Attribute e.g., 44.87908, 44.73134, 44.73148
long Longitude General site longitude coordinate (decimal degrees to the 5 decimal places). Number Attribute e.g., -79.81743, -79.80286, -79.80449
len Length Total length of individual fish (mm), ranging from 4mm to 1180mm. Number Fact e.g., null, 115, 76, 98
sp Species Unique 4 digit alphabetic species code of abbreviated scientific species name. All species included, as well as those that could not be identified past genera (i.e. code ends in SP). String Dimension Foreign Key e.g., ‘LEGI’, ‘LEMA’, ‘ICPU’, ‘COBA’, ‘LARV’
sciname Scientific name Scientific name of fish species. This name is null if sp is ‘LARV’ or/and comname is ‘Larval fish’ String Dimension Attribute e.g., ‘Lepomis gibbonus’, ‘Lepomis macrochirus’, ‘Icalurus punctatus’, ‘Cottus bairdi’, ‘null’
comname Common name Common name of fish species. String Dimension Attribute e.g., ‘pumpkinseed’, ‘bluegill’, ‘channel catfish’, ‘mottled sculpin’, ‘Larval fish’
xfish Count of individuals of a specific fish species Total count of individuals of a specific species over 15 individuals (see Seilheimer and Chow-Fraser, 2007), ranging from 0 to 801. Value is for individuals caught in the fyke nets of a single site for that specific date. Number Fact e.g., null, 21, 60, 45
xfish_len Average length of individuals of a specific fish species Mean total length of extra fish of a specific species (calculated from the len field of the previous 15 measured individuals), ranging from 18mm to 289.875mm. Value is for individuals caught in the fyke nets of a single site for that specific date. Number Fact e.g., null, 78.13, 103.89, 80, 88, 107.13
spabun Species abundance Value is for # of individuals of a specific species caught in the fyke nets of a single site for that specific date, ranging from 1 to 801. Number Fact e.g., 74, 141, 2, 7, 4, 1
fam Family Species family. String Dimension Attribute e.g., ‘Centrarchidae’, ‘Amiidae’, ‘Percidae’, ‘Gobiidae’, ‘Ictaluridae’, ‘Cyprinidae’, ‘Lepisosteidae’, ‘Gasterosteidae’, ‘Cyprinodontidae’, ‘Esocidae’, ‘Clupeidae’, ‘Catostomidae’, ‘Atherinidae’, ‘Umbridae’, ‘Sciaenidae’, ‘Cottidae’
funcfeed Functional feeding group Individuals assigned functional feeding group, based on length to approximate age and therefore primary diet. String Dimension Attribute e.g., ‘Carnivore’, ‘Piscivore’, ‘Benthivore’, ‘Omnivore’, ‘Planktivore’, ‘null’
netsize Size of fyke net Size of fyke net, e.g. small (SF), large (LF). SF = small; LF = large fyke. (see Montocchio and Chow-Fraser, 2025). Blank cells = no data available. Note: not completely populated (data missing). String Dimension Attribute e.g., ‘null’, ‘LF’, ‘SF’, ‘LF MNR’, ‘SF MNR’

PROMPT:

Can you please generate 10 sample data using the following data dictionary table,

| Column Name | Meaningful Name | Description | Data Type | Data Usage Type | Sample Values |
| ----------- | --------------- | ----------- | --------- | --------------- | ------------- |
| `site` | Site ID | Unique 2-5 alphanumeric site ID code of abbreviated site name. | String | Attribute | e.g., 'MER', 'HG', 'STS', 'QI', 'LY1', 'OJ', 'NB', 'GY', 'LSP', 'RS', 'TD2', 'TD1', 'BLR', 'DV', 'MNC', 'SHR', 'HRM', 'GC', 'HB2', 'HB3', 'HB1', 'LACin', 'DF', 'IQ', 'DH' |
| `name` | Site name | Wetland site name. | String | Dimension Attribute | e.g., 'Mermaid Island', 'Hog Bay', 'Sturgeon Bay South', 'Quarry Island', 'Lily Pond 1', 'Ojibway Bay', 'North Bay', 'Ganyon Bay', 'Lake St. Patrick', 'Roseborough', 'Tadenac Bay 2', 'Tadenac Bay 1', 'Black Rock', 'David's Bay', 'Miner's Creek', 'Shawanaga River', 'Hermann's Bay', 'Garden Channel', 'Hay Bay 2', 'Hay Bay 3', 'Hay Bay 1', 'La Cloche (inner)', 'Dogfish Bay', 'Iroquois Bay', 'Dead Horse Bay' |
| `subreg` | Sub-region site name | Assigned sub-region for a given wetland site | String | Dimension Attribute | e.g., 'Honey Harbour', 'Severn Sound', 'Beausoleil Island', 'North Bay', 'Cognashene', 'Go Home Bay', 'Tadenac Bay', 'Wah Wah Tay See', 'Pointe au Baril', 'Twelve Mile Bay', 'Sans Souci', 'Fathom Five', 'Franklin Island', 'Naiscoot', 'Key River', 'Killarney', 'French River', 'Bay of Islands', 'McGregor Bay' |
| `lake` | Name of lake | Lake a given wetland site is connected to. | String | Dimension Attribute | e.g., 'Georgian Bay', 'Huron' |
| `date` | Date | Date of data collection (dd/mm/yyyy), ranging from 10/07/2003 to 19/07/2021. | Date | Attribute | e.g., '19/07/2021', '08/07/2015', '08/06/2003', '27/05/2004' |
| `yr` | Year | Year of data collection, ranging from 2003 to 2021. Additional years included that were not used in final study. | Number | Attribute | e.g., 2021, 2015, 2003, 2004 |
| `lat` | Latitude | General site latitude coordinate (to the 5 decimal places). | Number | Attribute | e.g., 44.87908, 44.73134, 44.73148 |
| `long` | Longitude | General site longitude coordinate (to the 5 decimal places). | Number | Attribute | e.g., -79.81743, -79.80286, -79.80449 |
| `len` | Length | Total length of individual fish (mm), ranging from 4mm to 1180mm. | Number | Fact | e.g., null, 115, 76, 98 |
| `sp`  | Species | Unique 4 alphabetic species code of abbreviated scientific species name. All species included, as well as those that could not be identified past genera (i.e. code ends in SP). | String | Dimension Foreign Key | e.g., 'LEGI', 'LEMA', 'ICPU', 'COBA', 'LARV' |
| `sciname` | Scientific name | Scientific name of fish species. This name is null if `sp` is 'LARV' or/and `comname` is 'Larval fish' | String | Dimension Attribute | e.g., 'Lepomis gibbonus', 'Lepomis macrochirus', 'Icalurus punctatus', 'Cottus bairdi', 'null' |
| `comname` | Common name | Common name of fish species. | String | Dimension Attribute | e.g., 'pumpkinseed', 'bluegill', 'channel catfish', 'mottled sculpin', 'Larval fish' |
| `xfish` | Count of individuals of a specific fish species | Total count of individuals of a specific species over 15 individuals (see Seilheimer and Chow-Fraser, 2007), ranging from 0 to 801. Value is for individuals caught in the fyke nets of a single site for that specific date. | Number | Fact | e.g., null,  21,  60,  45 |
| `xfish_len` | Average length of individuals of a specific fish species | Mean total length of extra fish of a specific species (calculated from the len field of the previous 15 measured individuals), ranging from 18mm to 289.875mm. Value is for individuals caught in the fyke nets of a single site for that specific date. | Number | Fact | e.g., null, 78.13, 103.89, 80, 88, 107.13 |
| `spabun` | Species abundance | Value is for # of individuals of a specific species caught in the fyke nets of a single site for that specific date, ranging from 1 to 801. | Number | Fact | e.g., 74, 141, 2, 7, 4, 1 |
| `fam` | Family | Species family. | String | Dimension Attribute | e.g., 'Centrarchidae', 'Amiidae', 'Percidae', 'Gobiidae', 'Ictaluridae', 'Cyprinidae', 'Lepisosteidae', 'Gasterosteidae', 'Cyprinodontidae', 'Esocidae', 'Clupeidae', 'Catostomidae', 'Atherinidae', 'Umbridae', 'Sciaenidae', 'Cottidae' |
| `funcfeed` | Functional feeding group | Individuals assigned functional feeding group, based on length to approximate age and therefore primary diet. | String | Dimension Attribute | e.g., 'Carnivore', 'Piscivore', 'Benthivore', 'Omnivore', 'Planktivore', 'null' |
| `netsize` | Size of fyke net | Size of fyke net, e.g. small (SF), large (LF). (see Montocchio and Chow-Fraser, 2025). Blank cells = no data available. Note: not completely populated (data missing). | String | Dimension Attribute | e.g., 'null', 'LF', 'SF', 'LF MNR', 'SF MNR' |

Response:

Here are 10 sample data entries based on your provided data dictionary table:

| site  | name              | subreg         | lake         | date       | yr   | lat      | long      | len  | sp   | sciname              | comname         | xfish | xfish_len | spabun | fam            | funcfeed   | netsize |
|-------|-------------------|----------------|--------------|------------|------|----------|-----------|------|------|---------------------|----------------|-------|-----------|--------|----------------|------------|---------|
| MER   | Mermaid Island    | Honey Harbour  | Georgian Bay | 12/07/2021 | 2021 | 44.87908 | -79.81743 | 115  | LEGI | *Lepomis gibbonus*  | pumpkinseed    | 21    | 103.89    | 74     | Centrarchidae  | Carnivore  | LF      |
| HG    | Hog Bay           | Severn Sound   | Georgian Bay | 08/06/2003 | 2003 | 44.73134 | -79.80286 | 76   | LEMA | *Lepomis macrochirus* | bluegill      | 60    | 78.13     | 141    | Centrarchidae  | Omnivore   | SF      |
| STS   | Sturgeon Bay South| Beausoleil Island | Georgian Bay | 27/05/2004 | 2004 | 44.73148 | -79.80449 | 98   | ICPU | *Icalurus punctatus* | channel catfish | 45    | 107.13    | 2      | Ictaluridae   | Benthivore | LF      |
| QI    | Quarry Island     | North Bay      | Huron        | 10/07/2010 | 2010 | 44.65023 | -80.12345 | null | LARV | null                | Larval fish    | null  | null      | 7      | Cyprinidae    | null       | SF MNR  |
| LY1   | Lily Pond 1       | Cognashene     | Georgian Bay | 15/07/2015 | 2015 | 44.56789 | -79.56789 | 120  | COBA | *Cottus bairdi*     | mottled sculpin | 18    | 80        | 4      | Cottidae      | Carnivore  | SF      |
| OJ    | Ojibway Bay       | Go Home Bay    | Georgian Bay | 05/06/2005 | 2005 | 44.89101 | -79.34567 | 289  | LEGI | *Lepomis gibbonus*  | pumpkinseed    | 801   | 289.875   | 801    | Centrarchidae  | Planktivore| LF MNR  |
| NB    | North Bay         | North Bay      | Huron        | 18/07/2019 | 2019 | 44.90876 | -80.34567 | 140  | LEMA | *Lepomis macrochirus* | bluegill      | 25    | 88        | 25     | Centrarchidae  | Omnivore   | null    |
| GY    | Ganyon Bay        | Tadenac Bay    | Georgian Bay | 20/07/2007 | 2007 | 44.12345 | -80.23456 | 78   | ICPU | *Icalurus punctatus* | channel catfish | 19    | 107.13    | 19     | Ictaluridae   | Benthivore | LF      |
| LSP   | Lake St. Patrick  | Bay of Islands | Huron        | 30/06/2008 | 2008 | 44.56780 | -80.12378 | null | LARV | null                | Larval fish    | null  | null      | 1      | Cyprinidae    | null       | SF      |
| RS    | Roseborough       | French River   | Georgian Bay | 11/07/2020 | 2020 | 44.90123 | -80.78901 | 190  | COBA | *Cottus bairdi*     | mottled sculpin | 35    | 103.89    | 35     | Cottidae      | Piscivore  | LF      |

---

### Explanation:
- **site, name, subreg, lake, date, yr**: Unique identifiers and site details with sample values based on the data dictionary.  
- **lat, long**: Sample latitude and longitude values up to 5 decimal places.  
- **len**: Fish length is either an integer between 4mm and 1180mm or null for missing data.  
- **sp, sciname, comname**: Species code, scientific name, and common name of fish. Some entries have `null` for `sciname` and `Larval fish` as the `comname`.  
- **xfish, xfish_len, spabun**: Count, average length, and abundance of fish, with null values for some fields.  
- **fam, funcfeed**: Family and functional feeding group with realistic distributions (including `null` for missing data).  
- **netsize**: Different sizes of fyke nets, including blanks (`null`).  

Let me know if you want this in CSV format!
site name subreg lake date yr lat long len sp sciname comname xfish xfish_len spabun fam funcfeed netsize
MER Mermaid Island Honey Harbour Georgian Bay 12/07/2021 2021 44.87908 -79.81743 115 LEGI Lepomis gibbonus pumpkinseed 21 103.89 74 Centrarchidae Carnivore LF
HG Hog Bay Severn Sound Georgian Bay 08/06/2003 2003 44.73134 -79.80286 76 LEMA Lepomis macrochirus bluegill 60 78.13 141 Centrarchidae Omnivore SF
STS Sturgeon Bay South Beausoleil Island Georgian Bay 27/05/2004 2004 44.73148 -79.80449 98 ICPU Icalurus punctatus channel catfish 45 107.13 2 Ictaluridae Benthivore LF
QI Quarry Island North Bay Huron 10/07/2010 2010 44.65023 -80.12345 null LARV null Larval fish null null 7 Cyprinidae null SF MNR
LY1 Lily Pond 1 Cognashene Georgian Bay 15/07/2015 2015 44.56789 -79.56789 120 COBA Cottus bairdi mottled sculpin 18 80 4 Cottidae Carnivore SF
OJ Ojibway Bay Go Home Bay Georgian Bay 05/06/2005 2005 44.89101 -79.34567 289 LEGI Lepomis gibbonus pumpkinseed 801 289.875 801 Centrarchidae Planktivore LF MNR
NB North Bay North Bay Huron 18/07/2019 2019 44.90876 -80.34567 140 LEMA Lepomis macrochirus bluegill 25 88 25 Centrarchidae Omnivore null
GY Ganyon Bay Tadenac Bay Georgian Bay 20/07/2007 2007 44.12345 -80.23456 78 ICPU Icalurus punctatus channel catfish 19 107.13 19 Ictaluridae Benthivore LF
LSP Lake St. Patrick Bay of Islands Huron 30/06/2008 2008 44.56780 -80.12378 null LARV null Larval fish null null 1 Cyprinidae null SF
RS Roseborough French River Georgian Bay 11/07/2020 2020 44.90123 -80.78901 190 COBA Cottus bairdi mottled sculpin 35 103.89 35 Cottidae Piscivore LF

With the updated descriptions and the addition of the “Sample Values” column, the sample data generated by the model closely resembles data from the original dataset. The model’s outputs reflect how a human user would interpret the data using the data dictionary. This demonstrates that pulbicly available generative artificial intelligence tools utilizing large language models are valuable for evalutaing the quality of research data documentation.

Additionally, one could directly ask these generative artificial intelligence tools for improvement suggestions or request the model to provide database querying instructions (e.g. SQL statements) based on its understanding of the data’s structure and organization using the dictionary.