As a vendor, we are making significant progress toward full support of NAACCR XML. We are currently converting all of the SEER Abstracting Tool (SEER*Abs) so that they create NAACCR XML. SEER*DMS imports and exports NAACCR XML. We are evaluating all processes related to SEER and NAACCR submissions. We are dedicated to this project and we are making significant progress. SAS is one of the biggest challenges that remains. Can the workgroup provide guidance or support in developing a plug-in, library, or whatever would make it easier to import NAACCR XML in SAS?
Thank you for bringing this to our attention, it has prompted a lot of good discussion within the workgroup.
In the short term, you can convert a NAACCR XML file into a fixed-width file using the software here:https://github.com/imsweb/naaccr-xmlor here:https://www.cdc.gov/cancer/npcr/tools/registryplus/xml-exchange-plus.htm
But, that short term solution will not help after the 2020 XML transition when the fixed-width format is no longer maintained, so it will be important to have a longer term plan.Any SAS developer can use the SAS XML Mapper to develop their own custom NAACCR XML parsing logic: https://support.sas.com/rnd/base/xmlengine/index.htmlHowever, as a NAACCR Workgroup, we would like to provide a great solution for all NAACCR SAS users instead of asking everyone to develop their own. Currently, we do not have any SAS users in the NAACCR XML Workgroup, so if you know any that would like to participate, please tell them to get in touch with me (email@example.com).Until we get some SAS expertise in our workgroup, we were wondering if one of the following options would work:
1. I know that R and SAS are very similar, and since SAS is very expensive and therefore more difficult to develop for when you do not already have a license, if the NAACCR XML Workgroup provided a solution for processing NAACCR XML in R, would that help with a SAS solution?
2. Given that SAS can connect to relational database servers easily, if the NAACCR XML Workgroup provided a standardized way to import NAACCR XML files into one or more databases, would that be a good solution for SAS?
We recognize that SAS is a critical software tool for many NAACCR community members and want to provide an easy transition to XML as much as possible. We welcome additional feedback and discussion on this issue.
I will let Linda comment on your proposed solution.
But just so you know, we (at IMS) have a few SAS experts and we are currently investigating the SAS XML Mapper (it seems to be the standard solution for that type of problem).
Our first proof of concept went very well, but we are not ready to post any results yet. Once we are, we will report them to the NAACCR XML work group and update this forum.
I think we are done looking at SAS for now.
For reading SAS, it looks like an acceptable solution will involve an XMLMap file that tells SAS how to construct the data sets based on the different level of data in XML (so one data set for NaaccrData, one for Patient and one for Tumor, although all those sets can be defined in a single XMLMap file). The XMLMap will define SAS variables based on their NAACCR ID attribute (using XPath); it will also define a few "ORDINAL" variables which will be used as identifiers for every rows of the data sets (they are called ORDINAL because they are counters incremented when a specific tag is found in the data files). A SAS program will then be able to "merge" back the different data sets using the ORDINAL variables as a pivot (or linkage variable); the end result will be a single data set where the NaaccrData data is repeated for every Patient and Tumor, and the Patient data is repeated for every Tumor (which is the same behavior as reading flat files). There is one caveat to this solution: SAS will read and process every variables defined in the XMLMap; so using a mapping file that defines all variables won't be practical for large data files (the processing will be too slow). Instead, a smaller mapping file should be used with just the variables that are needed for the program. Hopefully it will be possible to create those specialized XMLMap files using an open-source software. I am attaching an example of a mapping file including only a few variables:– naaccr-xml-v16-data-sample.xml: a very simple NAACCR XML sample file– naaccr-xml-v16-sas-def-minimal.map: an XMLMap file containing the definition of one variable at each XML level (plus the ordinal variables)– readin.level2.sas: a simple SAS program that merges Patient and Tumor data from the sample files and print frequencies of the defined variables.– readin.level2.output: the results of running the SAS program (I only copied the relevant frequencies)
For writing SAS, the conclusion would be "don't do it". We found no satisfactory way of using an XMLMap to re-create a valid NAACCR XML file. There are other solutions that don't use an XMLMap but they are very involved and require some type of coding that most people wouldn't be willing to do. There are other tools and software that can recode variables and that will probably be updated to support NAACCR XML; the best approach for recoding XML files would be to switch to those tools.
*** Update: looks like I can't upload the files in this post; all the files have been uploaded in the java NAACCR XML project in GitHub:https://github.com/imsweb/naaccr-xml/tree/master/docs/sas
This sounds like a great way to do it, thank you for posting the files and explanation. I wonder what is the best way to get other NAACCR Community SAS users to try it out and give feedback.
I work in informatics at the Utah Cancer Registry and use SAS frequently so I am very interested in this topic. I will try out the code soon.
We read NAACCR files much more often than we write them, and we very rarely read all the variables in SAS. I think it could be reasonable to create a full mapping file for the community and recommend that they only keep the relevant variables- I can just delete variables from the .map file Fabian provided using Notepad++.Even though writing from SAS is problematic, so much of our work and analysis goes through SAS that I don't see us moving away from it for a long time. We would likely want a different tool to convert SAS output to NAACCR XML (like Seer Data Viewer? or something in Python, if its xml packages can make a valid file).
Valerie,Please post your results when you get a chance to try out the XML code in SAS, I know others in the NAACCR Community will be interested in your experience.You mentioned using a different tool to convert SAS output to NAACCR XML, possibly Python, and I noticed that there is an "official" way to access SAS datasets from Python with this library: https://github.com/sassoftware/saspyIf someone created a Python script to create NAACCR XML from a SAS dataset, would that be something you would be interested in?
Like many registries, we use SAS to pre-process all the incoming transmissions before loading them into a registry database. I need to separate RAPID reports from DEFINITIVE reports (they arrive mixed), clean up dates that areincorrect (MMDDYY, YYMMDD, YYDDMM,etc.), standardize hospital numbers (some reporters are required to send usblank hospital numbers), etc. So currently I read in all the variables, process, and write out all the variables. I've tried to figure out a way to do this with an XML and come up empty so far. And as part of the process, I generate management reports that make registry processing of cases easier. Converting each transmission (sometimes as few as 2 cases) to a flat file from XML, process, and then back to XML could be very labor intensive.
My trails with the SAS Mapper have not been successful so far; SAS says my files are too large and I've only used 8,000 cases.
I have also discovered, as many have said, that XML files are very large. I worry, given the charges for disk space and rapidly shrinking financial resources, if we can afford XML. Those of us without a programming staff will need some very robust tools to make this all work.
Bruce,Thank you for trying out the XML Mapper in SAS, we will discuss the issues you bring up in our XML Workgroup calls and post on this thread when we have some ideas of how to move forward.The tasks you are describing are straightforward in a language that has first class support for XML such as Python, Java, C#, etc., but we are still trying to figure out the best way to deal with large XML files in SAS that need hundreds of variables. Writing out XML files seems to be an afterthought in SAS, so we are still trying to get to the bottom of that issue as well. As you probably know, SAS is a very expensive piece of software, many of us on the XML Workgroup are not as familiar with it as we need to be, so it is taking longer to find solutions to these problems than with our currently published NAACCR XML software tools and libraries, but we are working on it.-Isaac
I tried out the sample code Fabian posted on XML files I created using various tools and our data for one year. The good news is that I got identical results from the XML files exported by the tools although they differed in size. For 8,000 cases, one file was 188,725 KB and one was 148,833 KB. The bad news is that it is very slow. SAS is provided under license to NPCR Registries and many take advantage of the opportunity. Few registries I know have any staff who know any JAVA, Python, or C++. If I know C++ or JAVA well enough to write code to manipulate XML, I would get a much better paying job.
One suggestion here was to use the XML tools built into MS SQL. We will explore that idea.
My experiments with SAS and XML have not been very successful. The loss of SAS eliminates a very powerful tool both for basic file processing prior to loading data in to the registry database and also working with the data on export from the registry database. I have little hope that SAS will invest in a more advanced XML tool.Here is one idea for a solution to at least create analytical files. SAS Proc Import will read delimited files with a header. This provides an option for two applications. One application is to be able to export from the main database selected variables in a pipe delimited format with a header. To make this more user friendly, the application needs a configuration page where you can just check the variables you need and be able to keep that list as a file for future use. Some users will only need to set the configuration once. Then SAS Proc Import can read in the delimited file and create the SAS data set.The second application would read an XML file and perform the same task as above.In both instances, one line for patient/tumor. Very few exercises require the entire set of all NAACCR variables so these analytic data sets should be fairly small.The major advantage of this method is that you do not need any input or format statement. The significant disadvantage is that PROC Import selects the input format so sometimes you get numeric when you want character, etc.Another version of above is write out two separate files. One file of pipe delimited data and a second file of the input format. The input format could easily dragged into a SAS program. The configuration page could allow for selection of formats. For example, I read in all dates as character since NAACCR allows date with blanks. In SAS, I can fill in the blanks before creating a SAS date that can be manipulated.The XML file for a standard time period, 1995 to 2018, will be very large. Few registries will have the storage capacity to keep a reasonable number of these files around. The ability to easily create analytic files is very important. Finding a very convenient way to upzip, run a tool or GenEdits, and re-zip will be important.
I am not ready to give up on SAS yet.It sounds like your suggestions above are along the same lines of what I have been wondering: If SAS doesn't really support large XML data files, is there an intermediate format that SAS could use instead of the XML directly?For example, here is a list of all "DBMS" formats that SAS can use natively in the PROC IMPORT function:https://support.sas.com/documentation/cdl/en/acpcref/63184/HTML/default/viewer.htm#a003094743.htm
Is it possible/probable/straightforward to create some sort of library that SAS can call directly (Java, Python, etc.) that will take an XML file, create one of these intermediate formats, and then load into SAS datasets so that the rest of SAS is happy?
Following up on my last post to this thread…I wonder if using a CSV formatted file would be a good intermediary between SAS and XML? The CSV format doesn't suffer from many of the same limitations as the fixed-width file, such as needing to know the position and length of all variables beforehand, so translating between XML and CSV will not require maintenance of Volume II metadata to go along with every NAACCR Item. CSV will still be limited for conveying multi-tier data, such as Patient/Tumor/etc., but SAS does not understand multi-tier data models anyway, so maybe that's OK for this use case.I have been playing around with some Java code running inside SAS that can generate CSV from NAACCR XML and then load the data as a SAS dataset. So far, it looks promising, it takes about 4.5 minutes to load a 6GB NAACCR XML file into a SAS dataset with this method, using a pretty basic Windows 10 desktop computer, not sure if that will be acceptable, but it might make a nice proof of concept.Here is what the SAS code looks like:
filename xmlfile 'C:\\Users\\isaac\\Documents\\ky9515v16.xml';
filename csvfile 'C:\\Users\\isaac\\Documents\\ky9515v16.csv';
declare JavaObj j1 ("edu/uky/kcr/naaccrxml/csv/ConvertXmlToCsv", xmlfile, csvfile);
proc import datafile=csvfile
The Java code behind this is using the Java NAACCR XML library from IMS
I think this is a good idea.
At the end, this is similar to what the NAACCR XML Utility tool does, except it translate XML into NAACCR fixed-column instead of CVS.
Did you use specialized code to read the XML, or did you use the existing Java library to read the data "patient by patient"?
I just used the existing Java library to read each patient as it occurred in the XML file, writing out an incremental "csvPatientId" number for each <Patient> element so that SAS would know where the unique patients were. If there is interest in this technique, I will post my Java code.
As a fun exercise, I started this experiment by creating an Access Database instead of a CSV file from the XML, mostly because SAS has "native" support for Access databases, much better than SAS XML support, and Microsoft Access has been mentioned several times as a tool that some registries use. Unfortunately, I quickly ran into limitations of the Access database format, specifically the 2GB file size and the number of fields in a table:https://support.office.com/en-us/article/Access-specifications-0cf3c66f-9cf2-4e32-9568-98c1025bb47cFrom what I can tell, Access can load CSV files with some fiddling, so maybe this solution would help both Access and SAS users.
I wanted to try your Java solution, but I ran into an issue: SAS uses a private JRE that they maintain and they are way behind: they latest version (SAS 9.4) requires Java 7 (which has been end-of-life for 3 years!). The NAACCR XML Java library is compiled under Java 8, and so it's not compatible with SAS 9.4.
I got that information from this link:https://support.sas.com/en/documentation/third-party-software-reference/9-4/support-for-java.html
How did you make your example run with the Java 8 NAACCR XML library?
For SAS 9.4, this is the magic parameter you need to set in the JREOPTIONS of C:\Program Files\SASHome\SASFoundation\9.4\nls\en\sasv9.cfg:-Dsas.jre.libjvm=C:\Program Files\Java\jdk1.8.0_161\jre\bin\server\jvm.dll
The instructions online about setting sas.jre.home are wrong, here is the complete JREOPTIONS setting from my sasv9.cfg file:
/* Options used when SAS is accessing a JVM for JNI processing */
You probably know this, but don't forget to set your environment variable CLASSPATH to point to your jar file.
I see. But this is a global setting you change on your local machine. The SAS instance I use is a company-wide instance running on a remote Linux server.
I guess I could ask our IT to change the SAS JRE globally for the company, but I am not sure they will accept that…
I still think it's an interesting solution, but I was hoping to be able to set the JRE when calling SAS (or that the default JRE would support Java 8 which has been out for 5 or 6 years now). That's a bit disappointing.
Thanks for the info though.
I put together a solution for reading using an XML Mapper and for writing using a tagset template. It seems to work fine for small data files but it doesn't scale well and those solutions are not really usable for big files.
I am currently working on a solution that involves calling a Java Archive (JAR) through SAS; the Java creates a tmp CSV file based on the XML and SAS can then easily read that CSV. The logic for calling Java is embedded in a SAS macro that can easily be distributed. This solution is still slower than dealing with flat files, but it's much more reasonable for big files than the XML Mapper and/or tagsets.
I posted all my code and experiments in the Java NAACCR XML GitHub project: https://github.com/imsweb/naaccr-xml/wiki (there is a NAACCR XML and SAS section at the bottom).
Please feel free to download those examples and try them yourself and provide feedback in this forum!
I tried the examples with Fabian's JAR and I think it's the best solution for SAS so far. It's more straightforward to a user than the XML Mapper (I did not try tagsets). No special configuration was necessary which is great. I like the ability to specify a short list of variables to read, this is easier than what we currently have to do for flat files!
I found the speed very good – it took slightly less time to read one of our annual submission files in xml than it did to read the flat file (both v16), and about twice as long to write it back out. I read abstracts in SAS far more often than I write them, so the writing being slower doesn't bother me. When I read a smaller full abstract file from a hospital (v16, converted), the speed difference was negligible.
Suggestions:-Add 'replace' option to the tmp csv import step-There are some truncation problems on import. With the submission file I observed this short list of variables imported up as $1. when they should be $2.-$4 (and therefore the truncated value was written out). tumorSizeSummary, tnmEditionNumber, tnmPathT, tnmPathN, tnmPathM, tnmPathStageGroup, tnmClinT, tnmClinN, tnmClinM, tnmClinStageGroup, radRegionalRxModalityWhen I read full abstracts there were 65 variables that were truncated such as addrCurrentCity imported as $14. instead of $50.
That's great, thanks for testing that solution!
I will add the replace option, that makes sense.
I assume that the truncating is because SAS only uses a subsets of the rows to determine the max length of a given column when reading CSV. If that's the issue then I think I have a solution. I will try it soon and post new files.
Hi Valerie, I implemented the changes we talked about; do you mind re-trying your example when you get a chance?
Note that I removed the version from the macro filenames; I figured it will be easier for people to just replace the files.
I also re-created the SAS JAR file with a fix for the length issue, I re-posted it under the same name (naaccr-xml-4.9-sas.jar) in the release page of the GitHub project (eventually the version will be increased but I figured this can still be considered the "first" version).
The truncation issues seem to be fixed! I agree the macro files don't need versions.
There's a problem reading text fields that contain CDATA, they are not imported.Example:<Item naaccrId="rxTextRadiation" naaccrNum="2620″><![CDATA[1/1/18 HOSPITAL – DR X. O'EXAMPLE: SOMETEXT – MORETEXT & SOMEMORETEXT]]></Item>
I will take another look at some point.
To make this work with SAS, which still requires Java 7 or earlier (which has been end-of-life for several years now), I had to implement my own (simple) parsing logic. So there are things that are not going to be properly parsed. Hopefully I can address them as they are found.
Hi Valerie, I fixed the issue you reported with the CDATA sections. I re-created the JAR file (with the same 4.9 version still) and re-posted it on GitHub. It would be great if you could confirm the fix is working as expected.
Yes overall the fix is working for CDATA sections! Just one minor additional fix, when there are pairs of  within the text, the second ] onward is consistently not read. It's cut off in the temp csv and sas dataset.Example xml:<Item naaccrId="rxTextChemo" naaccrNum="2640″><![CDATA[1/10/2016 DrugB (Part1, Part2, & Part3) @ Facility w/ Dr. Name. [DrugA started in 1/2015, but DrugB regimen planned] 1/15/2017 Drugc @ Facility w/ Dr Name2]]></Item>
The resulting variable only contains:1/10/2016 DrugB (Part1, Part2, & Part3) @ Facility w/ Dr. Name. [DrugA started in 1/2015, but DrugB regimen planned
I think this is likely because CDATA uses , I didn't have problems with any other special characters in the data I tested.
Thanks for testing again! It's really difficult to properly cover all those corner cases! I really should use a standard Java XML parser, but none of them is still compatible with Java 7 which is required by SAS. They really need to move along and update their Java version!!!
I will take a look at this one soon.
Hi Valerie, the issue you reported should be fixed now, and I actually released a new version of the library (version 4.10). If you happen to re-test this, please make sure to use that new version of the SAS JAR file and not the previous 4.9. Thanks!
Looks good to me, I don't see any other problems at this time!
Awesome! Thanks for testing!
Sorry to get your hopes up, reading looks good but I forgot to check writing! It seems broken in 4.9 and 4.10. The temporary output csv file contains all of the patients & tumors, but the xml only contains the first patient, their first tumor, and a tumor from another patient down to rxTextSurgery. I don't see anything obvious about why it stopped writing and why the tumor got placed under the wrong patient, it is correct in the temp CSV.
ex.<Patient><Item>Patient 1 info</Item><Tumor><Item>Patient 1's tumor info</Item></Tumor><Tumor><Item>Patient 2's tumor info</Item>
Successfully wrote:<Item naaccrId="textDxProcPath">9-9-16 HOSPITAL PATH-16-99999 BRAIN, TEST, TEXT: TEXT, WHO GRD 999. TEXT, BRAIN TMR: X/X TUMOR. XXX-9: TEXT. TEXT: TEXT, TEXT: TEXT</Item><Item naaccrId="textStaging">N/A</Item><Item naaccrId="rxTextSurgery">9-9-16 HOSPITAL: TEXT W/TEXT TEXT BY DR DOCTOR</Item> end of writing
next item to be written, is present in tmp csv rxTextRadiation:9-9/9-9-16 HOSPITAL, DR DOCTOR: XXX BRAIN (9999 CGY), 99 FX'S, XXXX & 9MV
It successfully wrote some variables that were read with CDATA, so I don't think that was the problem.
I looked more into the issue you described, but I can't reproduce it.
I used the following file:https://github.com/imsweb/naaccr-xml/blob/master/src/test/resources/data/sas/test2.xml
I tried to create a file that represents the data you described.
Could you please try that file yourself when you have some time, and confirm that it's also working for you. And if it is, can you please compare it with your own file and maybe try to figure out the difference?
I want to make the case in writing that a different approach is needed to get data out of a central registry database into analytical tools like SAS, GenEdits, InterRecordEdits, SEER*PATH, Match*PRO, etc.The NAACCR Volume 2 has the title 'Data Standards and Data Dictionary.' Then the next piece is called the 'XML Data Exchange Standard.' The primary goal of the data exchange standard is to ensure seamless transmission between registries be it a hospital registry or a central registry. Nowhere is it written that the XML Data Exchange Record has to be read by any of the analytical tools. Almost all of our analytical tools do not read or cannot read XML documents very well.
Plan B: A secondary standard is needed that allows for a pipe-delimited formatted ASCII file to be exported from a central registry database to be input into an analytical tool. The two models for this are SEER*STAT and MATCH*PRO.
The primary assumption I am making is that right now a pipe ('|') is not contained in any names, addresses, or coding schemes collected by a registry or imported into a registry software system. If that assumption is violated, then we need to find another delimiter.
I would like to see developed analytical file formats that consist of selected data items needed for normal work. For instance, prior to calls for data, a list of data items would be developed along with the order that could be brought into GenEdits and InterRecordEdits. That file format would be installed in the registry vendor software to output the subsets of necessary cases.
In New Hampshire, because we are so small, we would seek all cases 1995-2017 that meet the required criteria. The output file would contain approximately 136 reportable data items along with a few confidential data items to facilitate editing of cases. The resulting file would be smaller than an XML file, faster to output and faster to read in the analytical software.
I would strongly prefer that the header use NAACCR Item numbers as the variables names (N18_20, N18_390, N18_400, etc.) to make manipulation easier.
Rarely does a central registry need to output the entire NAACCR record. It would be necessary for inter-state data exchange, for archive purposes, and for transmission to some authorities.
Other pipe delimited file formats could be used for submission to the NAACCR Geocoder. Match*PRO, etc. SAS PROC IMPORT can easily read into a pipe-delimited file with a header.
The use of analytical pipe-delimited does not diminish the value of the XML Standard for Data Exchange. At a certain size, a delimited file becomes unwieldy and cumbersome.
Bruce, thank you for this request and your description of the problems you may face with XML. In the NAACCR XML Workgroup we have been discussing the utility of a delimited file format for certain use cases, specifically related to compatibility in SAS and other statistical software. As these discussions are ongoing, we are all forming and re-forming our opinions on the matter so I am not sure I can present a coherent picture of where the discussions currently land until we have more discussion. (You are welcome to join our Workgroup at any time)
On the subject of choosing a delimiter, we can define escaping rules for whatever delimiter we choose, so I am not worried about trying to guess whether a certain character will show up in the output or not.
On the subject of header names, I am not a fan of defining another name for data items. We are currently working with UDS to harmonize the NAACCR "Short Name" list with the current XML naaccrIds so that we can reduce duplicate naming efforts in the NAACCR Community. This will probably involve shortening the Xml naaccrIds and agreeing on a standard way to generate stable names across versions.
Issac,I was on the last call and I listened to the discussions.I work with about 150 variables from the NAACCR dataset in SAS. It ismuch easier to type the NAACCR Item number than some random short name. Thelookup is much faster. I am not asking for another name. The NAACCR numbersare in place.In a related topic, I am trying to work with Windows PowerShell to manipulateHL7 ePath records. PowerShell provides a very useful way to do that with onlya few commands. In my work, I discovered that PowerShell can also manipulateXML files. That should be very helpful.