Forum - XML Standard

 View Only

Need to build an interface for XML to SQL Data

  • 1.  Need to build an interface for XML to SQL Data

    Posted 05-06-2021 01:57 PM
    Jeff Reed

    We are kicking off a project to design/build/test an XML conversion interface for the NAACCR 'I' files that we are currently converting from fixed length format.

    This will be a solution delivered with a combination of C# class libraries and Oracle/SQL Server stored procedures and potentially third party tools like XML Spy. Currently we store the NAACCR Layout(s) to include fixed length field positions and fields types in the database which we use to dynamically build insert statements. At this point I am not sure if we will need to build a XML to Flat file converter or directly write the XML records into the database.

    Looking for recommendations on how to proceed with designing an XML interface. Understanding the complexity of the XML hierarchy as well as pointing us to any existing code/applications/test files would be a great help.

    Rich Pinder

    Thanks… Jeff this a long needed item, and one the community will be grateful for you to work on !! Rich Pinder here, programmer in earlier life and a long time NAACCR member.

    No matter which tools/platforms/approach you take, your work will help for sure! And as a resource for the Vendor community, I'm sure many will be eager to learn from your success.

    An open/generic approach which might work on a number of platforms WOULD be optimum. SQLServer and MySQL come to mind as ones more commonly seen these days by folks who might be interested in loading into SQL. Oracle was our platform of choice long ago, before it became too pricey for us.

    One thought be optimize a C# tool to unpack the XML (using XMLSpy to get the XML structure?), via a Class Library which might be up'gradable as the next Version of our XML def comes into play – but then perhaps look to an open source DB tool, maybe using the slow (but workable) ODBC generic driver approach, to spit out the insert code – that way it may be usable by more systems. Just a thought.

    While back I believe Isaac/Kentucky built something to load directly into relation database (MySQL perhaps?)… perhaps reviewing their approach would be good.

    Thats it for a quick comment – would love to chat (or e'chat) more on the thread as it unfolds. I'm always reachable @ 323) 442-1640

    Kathleen Beaumont

    Hi Jeff,

    NPCR produced XML Exchange Plus as a working example of an implementation of the NAACCR XML v1.3 specification. Whether you choose to use the XMLPlus.dll or simply use its API as a launch-point for your own implementation, you might find the information in the XMLExchangePlus help file a good place to start.

    The installer (includes XMLPlus.dll, a Windows application demonstrating the library's features, and a comprehensive help file) can be found at the NPCR web site:
    https://www.cdc.gov/cancer/npcr/tools/registryplus/xml-exchange-plus.htm

    Kathleen Beaumont
    retired programmer for EDITS50 and XML Exchange Plus

    Fabian Depry

    Hi Jeff,

    Another resource you might find useful is the Java implementation of the NAACCR XML specifications: https://github.com/imsweb/naaccr-xml

    You probably won't be able to use any of the code (Java doesn't play well with C# and vice versa) but you can certainly see many examples and get some ideas.

    The project home page contains links to the specifications and to a wiki page. I strongly encourage you to take a look at the wiki, it has all kind of information about NAACCR XML, including a (Java-based) standalone tool allowing you to go from XML to flat and flat to XML, and also a bunch of valid and invalid XML files that can be used to test your implementation.

    I hope this helps!

    Many thanks to all for the great resource links provided. I have always been a big fan of "borrowing" code where available and will be sure to develop as generically as possible to help share the wealth.

    My next steps will be to review open source tools as Rich suggested, review XMLExchangePlus, thank you Kathleen, and dig into the Java logic, thank you Fabian. Should be able to get some good design and architecture ideas from those sources.

    At this point I am thinking of developing C# class libraries for generic table definitions and integrating the code for converting XML to flat file's for "bulk loading". I am thinking the flat files could be "|" delimited or fixed length format with wrappers to load the flat files into the respective databases. This not not rule out using the tools suggested above.

    The second step of post load ETL processing would be more DB specific, Oracle in my case, as the data is validated and normalized for relational DB storage. We should be able to stick to ANSI standard SQL to help with different DB platforms.

    Many thanks to all, will keep you posted, I feel the excitement building!

    Jeff

    Kathleen Beaumont

    converting XML to flat file's for "bulk loading"

    This is just a thought, but what if you instead created "workspace" tables in your database that followed the structure of the XML file, i.e., a table for Patient data and another for Tumor data. Each would have primary key columns, and the Tumor tuples would have foreign key relationships to the Patient tuple.

    This would make it easier to perform case-specific and batch-update "fixes" to data on the way into the production tables. (This may not be something you do at your registry, but over the years I've heard lots of programmers say they need to fix misspellings and other such tweaks.) It would also be easy to write a "reviewer" app to look up cases, filter an incoming data file on any criteria, generate reports, whatever your analysts are clamoring for.

    Then you could run a handful of SQL insert/update queries to pull the data from the workspace tables into the production tables, and zap the workspace tables for next time.

    I just sort of hate to see you convert XML to flat ASCII files.

    I'll butt out now.

    Kathleen

    Jeff Reed

    Good Feedback thank you Kathleen. Part of my initial design considerations were to minimize re-coding of existing systems that dynamically loaded files based on NAACCR version from flat files. but it would make sense to load the XML directly and transform from there. Last time I had to load XML data we had to revert back to flat or delimited files due to performance issues that may not be as big of a problem with current SW/HW. And no butting out your in too deep now …

    Rich Pinder

    …butt out… oh no! I'd suggest a 'retired programmer' is just what's needed here !!! (how do I insert a grinning emoje here !)

    This approach that you're all talking about (i think) parallels what some of us talked about long back in the WG history, where it became apparent that the most sexy way to handle schmoozin the XML structured data into relational model would be via an intermediate step where you essentially build relations to emulate whatever structure our NAACCR XML flavor goes to (we have record/pt/tumor now…but down the line… LOTS more comin, me thinks). I think this philosophy of processing our stuff is along same lines you did Fabian for getting XML into SAS (datasets).

    Using the native SQL flavor that a vendor has installed makes a lot of sense. Extra/temp tables also makes sense, where review and manipulation of the data can be done before loading into the production tables.

    Using an open source, file based solution (ie SQLite) would also be a way to offer an alternative to unpack the XML into a ANSI compliant SQL based datastructure, without using a full fledged RDBMS.

    Great THREAD'n here folks… Fabian, mebe this little bunch should petition NAACCR to spin up another of those InPerson 'workfest' trips, like we did to initially grok out the XML structure! where did we do that …. was that Atlanta ?? (bad memory chips in this 'probably should be retired' programmer) [insert another emoji here, with fingers in ears, wavin hands!]

    adios for now …

    Kathleen Beaumont

    Hi Jeff,

    And no butting out your in too deep now …

    You may change your mind after this… 

    I admit that I prefer to a fault to write my own utilities, especially when existing tools annoy me. So if I were doing this, I would

    • read in a Patient record from the XML data file
    • iterate through a list of all possible Patient-level data items *
    • dynamically build an INSERT statement for just the items provided in the XML record
    • and execute it

    * You know what items matter to you, so if the XML contains additional/unexpected data items (i.e., ItemDefs) you are not interested in them anyway.

    Then do the same for all of the Tumor data associated with this Patient. I don't think it would take long to write a quick-and-dirty test app to see how the performance compares to using the XML importer built into Oracle. If performance is acceptable, you could make the tool easily configurable to modify the "list of items we care about", so that future changes to the NAACCR dictionary can be accommodated.

    Now remember, if you respond to me you're just encouraging me to keep making suggestions!

    Good luck, and I hope you'll continue to post about what you're planning. The subject is very interesting to me.

    Jeff Reed

    Thank you Kathleen,

    What a nice dissection you gave in navigating a NAACCR XML record from a patient record point of view. That makes a lot of sense to adopt XML translations to break down the records as part of this loading process. I could see breaking out the different record type's into separate XML load modules to support relational data models that would be useful. We only get the 'Incident' record type so our focus may be a little different as it centers on tumors with the classic struggle to uniquely identify a patient to weed out duplicates. (Pls use a separate thread for the duplicate topic 

    My background is heavy on database architecture tied to ETL performance so I do have a classic bias of "I can do it faster" that I am working on. That strength/weakness makes me still lean towards producing files for bulk inserting rather than individual insert statements as we regularly need to process over 100 thousand files a day. As we already have a lot of business rules and validation tables in use in the database we are looking to leverage that in the end solution. That said, I will continue to try and keep a focus on a universal solution that will support a broader audience.

    If nothing else my flat earth file mentality may keep "encouraging" you add light to this sea of data ….

    Kathleen Beaumont

    Jeff,

    Going back to basics: You are presently bulk-importing from flat-format files, and said you hoped to minimize re-coding existing systems, so let's loop back to that concept.

    Keep in mind that a big motivation for using XML is to allow transmitting complete text data (e.g., physician's notes, etc.) instead of truncating it to a couple of thousand characters, so you don't want to use the current NAACCR undelimited format. (These data items are identifiable in the NAACCR XML dictionary with allowUnlimitedText="true".) Also note that the plan is to eliminate the "starting position" attribute of NAACCR data items in the not-distant future, so positioning in a flat file will be something you'll have to maintain with your own "record layout". Boy, this is getting ugly really fast, isn't it?

    When importing from flat ASCII, do delimit the fields so that you can know where the text fields begin and end. But I recommend not delimiting with the pipe character ('|'). In fact, don't use any keyboard character because somebody is bound to embed it in a text field. I like the guillemet for this purpose ('»', typed from the keyboard with Alt+0187) because it is really unlikely to be typed during data entry. And you'll need to flatten the CRLFs in text data; I think SQL Server understands "~" to be a linefeed, but you'd have to look into what Oracle uses.

    You'll have to write the converter yourself; the freebie IMS and NPCR tools perform conversions between XML and the traditional undelimited NAACCR flat record.

    BTW, when you import data now are you loading into temporary tables, or are you loading straight into your production tables? And how long does it take you to load 100 thousand data files of Incident records each day? Again, I'm asking simply because I am finding this whole topic really interesting… so if you get bored with amusing me, feel free to ignore me!

    Jeff Reed

    Status Update:

    My first task was to test Oracle's capabilities to store and process XML. I used their bulk Data Loader program to load the sample xml NAACCR version 140 file into a CLOB (Character Large Object Blob) field that could then be parsed with Oracles XML functions. The field size for a CLOB can be up to 128 Terabytes. Didn't take me long to say forget this Oracle XML code given the CDC DLL's capabilities to parse a NAACCR XML file.

    Next task was to create a XMAL application, (Screen pic in attachment), with a back-end of C# for parsing an XML file using the CDC's XMLPlus.DLL. I got the application to read the XML file and assemble patient/tumor records in a record/field delimited file suitable for bulk loading into our DB. I am adopting what I am calling the KB delimiter in honor of Kathleen B., &#187. (»)

    Next Queued tasks
    • Integration of the CDC's EDIT50.dll for scoring and reporting.
    • Identify/create better testing data. The Current XML sample file is a NAACCR version 140 'A' record type. I need to test 160 and 180 versions for record type I.
    • Continue refining/testing load to add error handling and reporting.

    Kathleen Beaumont

    Hi Jeff,

    I'm glad you were able to use the CDC's XMLPlus.dll to parse a sample data file.

    Regarding the second bullet in your "Next Queued tasks" list: An easy way to create a NAACCR XML data file uses the GUI app in the CDC download, XMLExchange Plus. Open the Convert form to select a flat-format data file and an EDITS50-compatible NAACCR metafile (or, if your registry has a custom metafile, use that). Press F1 to bring up the Help pages for this process, and generate the data file(s) you need for your continuing development.

    The point I'm trying to make is that you can use a flat data file of Record Type '160' and generate a NAACCR XML data file that meets today's requirements. Be sure to read the help for drop-down control "Record Type", which allows you to specify the Record Type of the resulting data file.

    Kathleen Beaumont

    Regarding this "Next Queued task":

    * Integration of the CDC's EDIT50.dll for scoring and reporting.

    If you are just starting with the EDITS50 API, you might want to invest a half hour watching the "Tour of the EDITS50 API documentation" WebEx (see https://www.naaccr.org/forums/topic/edits50-tools-implementation-forum/ for the link).

    Kathleen

    Jeff Reed

    Are there any sample large record sets files for V18 fixed and XML files out there? Also looking for recommendations/links for tools for helping generate sample files (SEER?).

    Fabian Depry

    Hi Jeff, You can use the SEER Data Generator to create large NAACCR XML files. Only core data items will have a value (about 40 of 50 of them) so it's not perfect (for example, it would be nice to have text for Abstracts). But it's better than nothing 

    You can download it from this release page: https://github.com/imsweb/data-generator/releases

    Look for the "data-generator-X.X-all.jar" where "X.X" is the latest release.

    This a standalone JAR, so you should be able to just download it and double-click it (assuming you have a Java environment installed on your machine).

    You can find information about the data generator itself and what variables it computes on the project's home page: https://github.com/imsweb/data-generator

    Jeff Reed

    Brought a tear to my eye to be able to generate 5,000 record sample files for V16,V18 fixed and xml to include my test facility_id and home state of IL. Now for some real testing …

    Fabian Depry

    As long as those are tears of joy! 

    Good luck with your testing.

    Rich Pinder

    Hi all.
    Catching up a bit on last few posts, Jeff you've been busy!

    Having realistic and cross'variable logical data to process in useful testing is important for sure, and something another NAACCR WG has been working on for a while. Fabian/Isaac used an initial dataset from this group in the June HackAThon – work does continue in that group. I'll check to see the status and get back to all.

    You mention that you're next steps are looking to test the I record ? Looking forward to DB loader tool needs for XML, I think some will want to have the ability to include all record types. Will the tools be able to handle A records??

    Also..in your C# program, are you unpacking the XML to a SINGLE file (usin the KB delim!) … or into SEPERATE files, based on the relations? I was envisioning this temporary/intermediary approach (xml->flat->sqlinsert) might grow more complex as a future XML spec from NAACCR might (will?) move past consolidated data, and into fully nested processing data structures (pat->tum->admission, pat->genetic tests, etc). If that move does happen, would a single file still allow for efficient bulk loads ??

    aok… this is really encouraging to see, and a great help to the community!
    take care

    Identify/create better testing data. The Current XML sample file is a NAACCR version 140 'A' record type. I need to test 160 and 180 versions for record type I.

    Jeff Reed

    Thank you Rich,

    The data generation tool Fabian pointed me to in the github community looks like it did a good job of generating test files, (haven't used the files I generated yet). The tool populates 40-50 field variables so it is a good source to test basic load functionality but shy of being able to use for timing on a fully populated record load test.

    Though my needs are only for processing 'I' records, the tool will be able to handle all record types. I am currently working on incorporating a user customized XML reference file to define the fields to extract into the load file.

    I am handling the data as one flat file with each row representing a case/tumor. Patient/header data will be repeated on each row. Much of the validation and breakdown of the data will be done in the DB. I still see the database needed to check for duplicates and generate aggregate totals.

    Building multiple files based on a relational data model is not envisioned at this point but I could see where that would be useful if there is a global data model of the data stores to match the XML. Generic SQL to generate schemas could be a component of that effort. Relationally we consider a unique case/tumor key on the facility_id, Accession_nbr and sequence_nbr. Identification of the patient is a different story which I am sure there is a healthy thread out there somewhere…

    A correlated side project will be to use the EDIT50.DLL from the CDC to apply edit checks/scoring for each tumor record. This will be incorporated into the NAACCR XML file parser I am building (looking for a name for this beast). This process would also create KB delimited file(s) for use in bulk loading.

    Thank you for all your feedback

    Jeff Reed

    Just noticed the test files from the github sample file generator did not include the <naaccr num> tag under the <item> tag where as the sample file example on the NAACC sample file did. is this <naaccr num> tag going to required or do I have to use the full name in the <naaccr id> tag to field map?.

    <Item naaccrId="patientIdNumber" naaccrNum="20″>01200001</Item>
    vs
    <Item naaccrId="patientIdNumber">00000001</Item>

    Fabian Depry

    The standard itself doesn't require the "naaccrNum" attribute because technically the "naaccrId" is all that is needed to uniquely identify an item. But a lot of software still deal with the numbers, and so it's convenient to have them.

    I will update the data generator to allow an option to add the numbers to the created file (https://github.com/imsweb/data-generator/issues/30).

    Ultimately, it's up to you if you want your new framework to require the numbers on incoming data (which is probably more convenient for you), or if you want to follow the strict standard and deal with not always having those numbers.

    Jeff Reed

    Thank you Fabian, I guessed wrong, will switch to the required field <Item naaccrId=> I was hoping naaccrNum would be required so I didn't have to worry about the full name match.

    Is the best source to use for mapping the version_id to the valid xml name the "naaccr-dictionary-180.xml" file?

    Fabian Depry

    Yup, the "naaccr-dictionary-180.xml" is the best source for mapping NAACCR numbers to NAACCR IDs. At least for standard items. For non-standard items, that mapping should be provide in a "user-defined dictionary" by whoever created the XML data files (the data-generator doesn't support user-defined dictionaries; I wanted to keep things very simple for now).

    Jeff Reed

    Cant seem to get the XMLPLUS DLL function: XMLPlus_GetItemDataByNaaccrId(const int XmlId,const char* naaccrId) working

    I was able to get the function: XMLPlus_GetItemDataByNaaccrNum(const int XmlId,const int naaccrNum) working

    the function …bynaaccrid doesn't seem to like the pointer I try to set. the function I got working does not use a pointer and just passed the numeric ID value. Is it reasonable to require the NAACCR Number in the data? That would be my preferred solution rather than using the shortnameID.

    Kathleen Beaumont

    Hi Jeff,

    Cant seem to get the XMLPLUS DLL function: XMLPlus_GetItemDataByNaaccrId(const int XmlId,const char* naaccrId) working

    First, what are the return values of your calls into these functions? If any is anything except the all-is-well 0 (zero), you should use the functions in the chapter Handling System Errors to get better information.

    Let's start with that, and if needed we can dig into your code. Please note, however, that my language is C++. If we end up in some technical area of C#, we're going to need to enlist the help of somebody with expertise in that area.

    Is it reasonable to require the NAACCR Number in the data?

    FWIW, XMLPlus.dll always writes the data including both the naaccrId and the naaccrNum, because it makes editing the data more efficient. As to being able to require the naaccrNum, that probably depends upon how much clout you have with your reporting entities. Maybe you can make it a rule for your region?

    Kathleen Beaumont

    Perhaps the byNaccrrid function will not work in C#?

    Oh, it will work all right. I just have to enlist the help of a C# programmer at the CDC to look at it. Dollars to donuts it will be the something about the declarations.

    The help file says the C# declaration for this function should look something like this:

    [DllImportAttribute("XMLPlus.dll", EntryPoint = "XMLPlus_GetItemDataByNaaccrId")]
    public static extern int XMLPlus_GetItemDataByNaaccrId(int XmlId,
       [InAttribute()] [MarshalAsAttribute(UnmanagedType.LPStr)] string naaccrId);

    (I just typed that from the help file, so please look to it there.) My hunch is that your declaration is not properly converting a C# string to a pointer to array of char, which of course is what you need for a C interface across languages. In the meantime, I'll see if I can get somebody else to write a test for these functions at the CDC.

    Kathleen Beaumont

    Jeff,

    Would you mind running a test for me? You said you were able to get your code to retrieve data values using the "byNaaccrNum" function. Could you run the loop so that you first call "byNaaccrNum" and then call "byNaaccrId"? I'm trying to narrow the problem down to support my hunch about the declarations, and need to know that all of the other functions are working as advertised.

    So the pseudo-code would look like this (note that all of these calls should capture and test the return value… this is just pseudo-code for program flow):

    XMLPlus_Initialize(...)
    XMLPlus_OpenXmlDataFile(...)
    
    /* while not end-of-file */
    {
        XMLPlus_ReadNextPatient(...)
    
        XMLPlus_GetPatientTumorsCount(...)
    
        for (<iterate over tumors for current patient>)
        {
            XMLPlus_ReadTumor(...)
    
            for (<iterate list of data items you want>)
            {
                 XMLPlus_GetItemDataByNaaccrNum(...)
                 XMLPlus_GetItemDataByNaaccrId(...)
            }
        }
    }
    
    XMLPlus_CloseXmlDataFile(...)
    XMLPlus_Exit(...)

    You might even hard-code some items for this test, e.g., naaccrNum=390 and naaccrId="dateOfDiagnosis". Let me know what happens.

    Jeff Reed

    Thank you Kathleen,

    Your pseudo code is what I have in development so no problem testing. Looks like I have the problem using a pointer return variable to access the actual string. The bynumber() function I successfully used returns the value so I have no problem there, the byNaaccrId() returns a pointer and apparently I am rusty on using pointers to access the string value because I cant seem to get it working and I am not finding a good example.

    [DllImportAttribute("XMLPlus.dll", EntryPoint = "XMLPlus_GetItemDefByNumber")]
    public static extern int XMLPlus_GetItemDefByNumber(int XmlId,
    int naaccrNum, System.IntPtr owner, System.IntPtr callback_func);

    [DllImportAttribute("XMLPlus.dll", EntryPoint = "XMLPlus_GetItemDefByNaaccrId")]
    public static extern int XMLPlus_GetItemDefByNaaccrId(int XmlId,
    [InAttribute()] [MarshalAsAttribute(UnmanagedType.LPStr)] string naaccrId,
    System.IntPtr owner, System.IntPtr callback_func);

    Kathleen Beaumont

    Jeff, I'm getting confused.

    I thought you were working with the functions to read the data file, but the declarations you just quoted are for reading the dictionary.

    When reading the data file, you pass the callback parameters (owner, callback_func) to the XMLPlus_OpenXmlDataFile function. My thinking was that you'll be calling the "XMLPlus_GetItemDataBy[NaaccrId or NaaccrNum]" function gazillions of times, so just post the callback once for the entire run. (There is some set-up involved with posting a callback in the DLL, not costly, but why waste cycles? Reasonable people can argue about my design choices…)

    Looks like I have the problem using a pointer return variable to access the actual string.

    The "GetItemDataBy" functions don't retrieve the data value directly; instead, the DLL locates the value by the identifier, and then sends the "answer" back through the callback function. I have a very good reason for this design choice: If you expected the DLL to populate a reference variable supplied in your direct call, you would have to first ask how long that string value is so that you could size your string appropriately (otherwise, danger of buffer overflow!). But through the magic of pointers, the DLL is maintaining the contents of the requested item in its local memory long enough to stream it to your callback function, where you can capture it to a C# string and manage that memory yourself.

    Take another look at the C++ sample, "Run EDITS on NAACCR XML Data File", and look for XmlToFlatReadItemByNaaccrNum and XmlToFlatReadItemByNaaccrId. Particularly for preparing the EDITS buffer dynamically, you can see that more steps are required when you have to do the look-up by naaccrId, but it can be done.

    Jeff Reed

    sorry copied the wrong ones, I am using the:
    [DllImportAttribute("XMLPlus.dll", EntryPoint = "XMLPlus_GetItemDataByNaaccrNum")]
    public static extern int XMLPlus_GetItemDataByNaaccrNum(int XmlId,
    int naaccrNum);

    [DllImportAttribute("XMLPlus.dll", EntryPoint = "XMLPlus_GetItemDataByNaaccrId")]
    public static extern int XMLPlus_GetItemDataByNaaccrId(int XmlId,
    [InAttribute()] [MarshalAsAttribute(UnmanagedType.LPStr)] string naaccrId);

    When I was using the ByNaaccrNum I could refference the string in the callback:
    iRtn = XMLPlus_Interop.XMLPlus_GetItemDataByNaaccrNum(xmlId, t_num);
    sPPatient = sPPatient + readitem_callback.ToString() + "²";

    The ByNaccrId call is different in that I need to use the pointer that is returned. The process of building a char array and a stringbuilder routine to build the value is what I was hoping there was some examples of since I am having trouble getting the pointer to the value working.

    Kathleen Beaumont

    The callback parameters are the same regardless of whether you are processing by naaccrNum or naaccrId and have the same format. Maybe I'm wrong here, but when I look at this function from my help file, isn't parameter "value" a C# string? Or are you having to do something to it when you use this code by naaccrNum?

    private static void ReadItemByEitherNaaccrIdOrNaaccrNum(
        System.IntPtr owner,
        int patint_ordinal,
        int tumor_ordinal,
        [InAttribute()][MarshalAsAttribute(UnmanagedType.LPStr) string NaaccrId,
        int NaaccrNum,
        [InAttribute()][MarshalAsAttribute(UnmanagedType.LPStr) string value)
    {
        // etc.
    }

    The ByNaccrId call is different in that I need to use the pointer that is returned. The process of building a char array and a stringbuilder routine to build the value is what I was hoping there was some examples of since I am having trouble getting the pointer to the value working.

    I interpret the above syntax to be an instruction to C# that the parameter will arrive as a pointer to array of char, which C# converts on the spot to a C# string. You shouldn't have to do anything else special to be able to use these strings in your code.

    May I see the code for the callback function in your code that works when you are retrieving data by value?

    Kathleen, still hoping a C#-to-C maven will join this thread!

    Jeff Reed

    This C# code I use works:
    iRtn = XMLPlus_Interop.XMLPlus_GetItemDataByNaaccrNum(xmlId, t_num);
    sPPatient = sPPatient + readitem_callback.ToString() + "²";

    This C# code the string and returns the pointer value which I need to use to build the string value:
    iRtn = XMLPlus_Interop.XMLPlus_GetItemDataByNaaccrId(xmlId, t_id);
    sPPatient = sPPatient + readitem_callback.ToString()
    // need to fill string value from returned pointer, function requires void def. not char for pointer;

    I'll keep looking, The code works for num may have to go back to that and ask Fabian if he could add naaccrnum to test file

    Kathleen Beaumont

    In your code, where and how is readitem_callback declared?

    Is its value assigned in the callback function you passed to XMLPlus_OpenXmlDataFile? That is the function I want to see. From the context of what you wrote above, I'm guessing readitem_callback is a global variable. Its scope will be important to this discussion.

    I don't mean to be haranguing you, it's just that resolving this issue in this public forum could benefit others. I don't want to leave the questions unanswered or incomplete.

    Thanks, Jeff.

    Jeff Reed

    Kathleen thank you for your patience,

    Took a break to lure a large mouth bass but I can report I got the pointer working for the XMLPlus_GetItemDataByNaaccrId function. Somewhere in the code addition to support both the naaccrid and naaccrnum return values I over engineered the pointer assignment, nothing like a full re-write to flush out the bugs. The Edit50.dll worked like a charm when fed a proper pointer.

    Steps to success:
    Added a Pointer:
    XMLPlus_Callbacks.Callback_ReadXmlData funcPtr1 =
    new XMLPlus_Callbacks.Callback_ReadXmlData(ReadItemByNaaccrNum);

    XMLPlus_Callbacks.Callback_ReadProgress funcPtr2 =
    new XMLPlus_Callbacks.Callback_ReadProgress(ProgressFunc);

    XMLPlus_Callbacks.Callback_ReadXmlData funcPtr3 =
    new XMLPlus_Callbacks.Callback_ReadXmlData(ReadItemByNaaccrId);

    Used a switch to set pointer:
    if (cb_m1_byname.IsChecked == true)
    { readitem_callback = Marshal.GetFunctionPointerForDelegate(funcPtr3); }
    else
    {readitem_callback = Marshal.GetFunctionPointerForDelegate(funcPtr1); }

    call the function with a switch:
    if (cb_m1_byname.IsChecked == true)
    {iRtn = XMLPlus_Interop.XMLPlus_GetItemDataByNaaccrId(xmlId, t_id); }
    Else
    {iRtn = XMLPlus_Interop.XMLPlus_GetItemDataByNaaccrNum(xmlId, t_num); }

    I did end up using a global variable to pass the return value but will change that to have the functions call a generic function to build the output.

    private static void ReadItemByNaaccrId(
    System.IntPtr owner,
    int patient_ordinal,
    int tumor_ordinal,
    [InAttribute()][MarshalAsAttribute(UnmanagedType.LPStr)] string NaaccrId,
    int NaaccrNum,[InAttribute()][MarshalAsAttribute(UnmanagedType.LPStr)] string value)
    {
    globalVariable.sValue = value;
    }

    The test used a test file generated from the github test file generator with 5000 tumor records. The program loops through a patient with 7 field value calls to the get value function and loops through tumors with 127 calls to the function to build a delimited flat file of tumor with redundant patient info. It took ~11Min to process the file with Debugging on.

    Jeff Reed

    Sorry to report I probably will not be able to attend tomorrow's meeting.

    Attached are some screen shots of the current progress using the XMLPluss.DLL. 100 records .4 milliseconds, 250 records 2.9 Seconds, 500 records ~12sec.

    There is a problem with memory not being cleaned out as the application slows down significantly over 500 records. The memory usage builds up till it takes over 14 Min for 5000 records or just fails with memory unavailable. (Box is has 16gb memory, 8gig free). Open to ideas where to look for clearing memory from our seasoned XMLPlus users …

    Kathleen Beaumont

    There is a problem with memory not being cleaned out

    You need to run a profiler to pin down exactly where your trouble is, but just guessing it seems you are allocating memory for repetitive processing and counting on the C# garbage collection process to take care of things. Even as a C++ programmer I learned I have to "think in C" when writing batch processing software. (I love the Standard Template Library, but it's not the right answer when nanoseconds count.)

    So look for ways to re-use allocated objects. For instance, in the C++ code sample for running edits, the XMLPlus API shows that I build a list of allocated objects describing each of the data items I'll be processing, building it just once, outside the processing loop. Most of the properties of those objects are unchanging (name, length, ParentXMLElement, etc.), but for running edits the value property needs to change for each case. So for each data record, I simply update the data value. Try running edits on large data files (I had a test file of more than 83,000 cases… and boy, did that make a huge XML file! … performance did not degrade from the first case to the last).

    Anyway, I recommend using a good profiler. I always relied upon AQTime from SmartBear. I used it for C++ and Object Pascal projects, but it also supports the .NET languages.

    Jeff Reed

    Don't know what I would do without you KB, You zeroed in on the problem. The garbage collection of C# (PC Version: Recycled Resources), did clean out memory but not the large file content I was building in memory. I changed it to write to file every tumor instead and the 5,000 case test file built in 3.3 seconds down from 14 Min.

    Next step integrate with a .Bat file load integrator (oracle and SQL Server).

    Attachment(s)

    docx
    NAACCR-XML-to-SQL-Load.docx   168 KB 1 version
    docx
    XML-WorkGroup-180718.docx   84 KB 1 version