Forum - XML Standard

 View Only
  • 1.  Sample XML Files

    Posted 05-06-2021 02:46 PM
    Kathleen Thoburn

    The sample XML files posted to the NAACCR XML website are quite rudimentary, having only a PatietID and primary site. Only v16 is posted. Is there any way that a full v18 sample file with fake data can be posted to the website?

    Fabian Depry

    Hello Kathleen,

    The samples will be updated to NAACCR 18, but like you said, they are rudimentary; their purpose is to provide a set of "valid" vs "invalid" files that can be used to test a given software.

    There are tools that can create more complete "fake" data files (I know the SEER Data Viewer (https://seer.cancer.gov/tools/dataviewer/) does that and I think there might be others out there, maybe someone else will comment on this); but as far as I know, those tools only set a small subset of values.

    I think a "full" sample file (meaning all variables filled in) would probably need to be crafted by hand. I will bring this topic to the NAACCR XML workgroup during our next meeting.

    Isaac Hands

    Kathleen, I assume you are talking about the sample files posted here:
    https://github.com/imsweb/naaccr-xml/wiki/2:-Sample-Data-Files

    Those are definitely "rudimentary" and we will update them to v18, but that may not be what you are looking for. If you want meaningful v18 XML data right now, you can convert an existing fixed-width v18 file into an XML file very easily using the NAACCR-XML Utility tool:
    https://github.com/imsweb/naaccr-xml/wiki/1:-NAACCR-XML-Utility-Tool

    Also, there is a NAACCR group that works on creating synthetic data and they created some nice synthetic XML data sets for the NAACCR Hackathon last year. I know that those particular datasets were bound by data-use agreements since they were based on real submission files – so we will explore having those released to a wider audience.

    Isaac Hands

    I did some investigation and anyone that wants to obtain a synthetic v18 XML dataset for evaluation or training, containing about 500,000 records, can contact Recinda Sherman at NAACCR (rsherman at naaccr.org) with the following information in your email:

    1. Explanation of what the data will be used for and whether you want record type I or C – the data cannot be shared outside of your stated use.

    2. Filled out "Data Confidentiality Agreement for Researchers" document from this url: https://www.naaccr.org/irb-information-for-cina/#IRBFORMS

    This synthetic dataset is not just junk data, Recinda can explain the characteristics best, but the data values are meaningful with respect to the distribution of values from actual cancer datasets and have been appropriately anonymized.