Re: NMR data as Supplementary Material ?

Rudi Nunlist (rnunlist@bloch.cchem.berkeley.edu)
Sat, 08 Apr 1995 16:55:19 -0700

Several Weeks ago I asked for comments on the (im)possibilities of
providing NMR data (preferably the FIDs) for central archival purposes to be
accessible on Internet, as electronic Supplementary Material.
While I had only asked about the technical aspects, several responses take
issue with the concept as such.

Belated thanks to all who took the time to respond!

Rudi Nunlist

------------------------------------------------------------------

Wellllll, that's a sticky one. While it seems nice to have raw data
available, what would be the reason to have access to it? Are we then
allowed to reanalyze the results? Could we infer information that the
original authors overlooked and publish another paper? I'm not sure
what purpose this type of information serves or who it would benifit.

>From the practical aspect this could be the final nudge needed to get
the manufactures into some sort of standard format. I mean if the
interest was truely there they would be forced to comply with at least
a "journal" format for there files. Of course it would be difficult
for researchers with older, unsupported machines to comply with this
requirement/request but surely current operating systems could ac-
commidate. Thinking about it there are only a limited number of
variables needed to define a NMR dataset, block size, sweep width,
offset (a niceity, not necessary), operating frequency, field
strength, and word length. In two dimensions it adds two more
constants. The rest of the vendor stuff is necessary for operation
or reference but not needed to process the data. Once the basic
processing parameters are realized then it becomes a matter of
mathematics and computer science. If the journals are interested
they could decide on a "standard" processing software. In practice
the files could be very simple. Early versions of FTNMR and FELIX
only required that you pass about 12 parameters and away it went.

I don't know, but at first thought it would seem that most of the
work would/could be done by the vendors/publishers.

Then again I could be blowing it out my ear again.
------------------------------------------------------------------

A collegue in the NMR lab here forwarded me this note.

Yes, I have done some of this. At NIEHS, much of my work is
in the area of providing Internet (WWW) access to the scientific
databases I create in Oracle. For example, we have a EPR database
http://157.98.12.104/LMB/home.html
and an upcoming DNA clone database which is not publicly available yet.
If you are interested in finding a contractor to work on this project,
I would be very interested. If you just want some information
on how to get started, I am willing to help out all I can.
------------------------------------------------------------------

I have given this lots of thought because I have several friends who teach at
smaller, UG institutions that don`t have NMR instruments. These people often
use "canned" data and third-party software to teach NMR in their courses. Such
a database would be REALLY useful.

>I have thought about some of the (im)possibilities, and so far, have not come
>up with any terrific solution. The main issues (I think) for 1D data are:
>1) Data format (currently more than one per vendor..),

I would say that the "new" Felix format is the closest thing that we have for a
"standard"; however, that kind-of gives an advantage to a single software
vendor.

>2) Processing software cost and availability (Several flavors of Unix, Mac,
>PC etc).
This might be the answer. Woody's "Nuts" has the very nice feature of
auto-sensing the format of the data file in question, and importing it without
user intervention. This would be a real plus for such a project because the
"data depositors" could simply provide the data in whatever format is most
convenient for them. Nuts-1d is only $400 for the first copy (educational),
and the price goes down for more copies.

>For 2D data, the problems will be a bit more complex....
>
Yup!! I have (and use) Felix, Sybyl/Triad, Felix for Windows, Felix 1.1 for DOS
(beta version), Nuts, and Grams/386(Galactic, Inc.) Of these, Sybyl/Triad,
Felix(Unix), Nuts(PC), and Felix 1.1(DOS...never released) all have 2D
capability.
I've only been working with Nuts-2D for a week or so, but it's already got
my vote
for most bang-for-the-buck for a "PC" based package...especially considering its
flexible data import capability.

Whatever we (the NMR community) would decide to use, I would think that if we
"picked" a standard software platform around which to build the database, we
should expect some funding to be kicked in by the software vendor in question.
Building/maintaining such a database would be lots of work, and I could see
one of us hiring an extra staff person to take on such a task, which could be
partially/totally funded by one or two software vendors who are interested in
cooperating with such a project...just some rambling thoughts.

Either way, like all of such ideas, the key would be to get people to
participate
by providing data, and to have someone committed to maintaining the database.
If somesuch deal could be worked out, where we can get some funding for an extra
person to do the grunt work, I'd be willing to carry the ball.
------------------------------------------------------------------

Now I'm not a NMR person, rather a nmr person, but this sounds like one of
the silliest ideas out in a long time. Just because technology allows us to
save masses of data it doesn't mean that doing it makes sense.
Why save fids. Will anyone ever use them, really? I'd like a direct
measurement of the number of x-ray data sets which have been re-analyzed by
persons other than the original owners of the data sets, at least since
1980 when automatic data collection got really into full swing. I think
that the number of such data sets would be near zero. The x-ray community
started this back when computation was really expensive and errors might be
made because of approximate solutions or faulty software. That's pretty
unlikely now.

In any case I think it's silly to archive FIDs. I realize that it's not
your idea but it's easy to complain to you.

Write me as Scrooge.
------------------------------------------------------------------

I am of the opinion that NMR data should be archived in the format
in which it was generated--not converted into any other format, based on
the following:
1) If you download someone else's data to process it, you will be doing it
in one of the offline processing packages.
2) These packages by and large already come with internal data filters or
external conversion programs for all of the existing file formats.
3) If a new standard is defined, then we will be dealing with n+1 formats,
where n is already too large.
4) If one of the formats of an existing offline package is adopted as the
ACS standard, then it would be necessary to convert from one offline format
to the others, something we don't have off the shelf programs to do.
I suppose it could be argued that if ACS adopted any format as the
standard, it would put pressure on the spectrometer and software vendors to
switch to it and pave the way for a simpler future (at least in that one
respect). However, it would be a long time before the installed base of
"nonstandard" instruments all died and we could all be in that brave new
world. In sum: the current level of chaos is manageable; the initial
result of any change would be an additional level of complexity.
------------------------------------------------------------------

The only data format easily readable by ALL computers is ASCII text.
I have used the JCAMP format to move data around among PCs.
It seems that this would be a reasonable approach.
------------------------------------------------------------------

Personally, I'd get them to work on just getting their abstracts and
articles-as-published available elcetronically, first.....

However, I would have some reservations about this whole concept.

Sure, get the articles-as-published available electronically. Download and
plot out (as picture files) published spectra, etc etc, if that will help
your own work.

But actually making the FID available and allowing others to download it,
play around with it, re-process it as they see fit - I see this opening up a
whole minefield of nasty situations.
I could imagine people disputing the work of others based on the second
party's interpretation of the spectra. Two bio-molecular groups working on
the same structure could begin to question each other's assignments and
conformation calculations; as much to further their own cause as to
objectively critisize. Not to mention someone coming along who thinks he
knows it all, but knows little, playing with the data for 5 minutes and
making outlandish accusations. This could include another user erasing or
creating false splittings, etc etc, by injudicious use of window functions,
or slectively removing or adding peaks by adjusting thresholds.

Maybe that's a worst case scenario.... but... worth considering.

Another more minor point is how much disk space do they expect this to take
up. Who is going to want to store every published 3D or 4D data set in its
entirity???

Well, just my 0.02c worth - hope they're food for thought.
---------------------------------------------------------------------------

As a vendor, I am often concerned about data interchange and format
conversion. I was involved in a couple JCAMP meetings, but JCAMP is
simply inadequate for NMR data. One reasonable approach is the netCDF
format. Alex Macur, formerly or NMRi and presently with Tripos, had
done a vast amount of work on an NMR netCDF specification. NetCDF has
the advantage of both binary and ascii formats and the ability to
(or possibilty to) handle real, complex, TPPI, States, hypercomplex, etc
NMR data. What is lacking is 1) a group to finalize the NMR netCDF
specification and 2) a group to handle the validation of each netCDF
vendor.
The Mass spectrometer community generated a netCDF specification a couple
of years ago. The ASMS (American Society of Mass Spectrometry) runs the
specification and validation.
The NMR community does not have a comparable organization to the ASMS and I
believe this is why a similar exchange format has not developed.
One possibilty is for the ACS, ENC, or AMMRL to pick up the pieces
of the preliminary netCDF specification, finish it, and act as the
validating organization. If all vendors (hardware and software) are
involved then part 2 of your question regarding processing will
probably follow quickly.
-------------------------------------------------------------------------

This should be a hot topic with a lot of responses. I would be surprised if you
are not inundated with mail. Good luck!

Since you asked for thoughts and I have had occasion to think upon this subject
I supply the following:

1. It is critical to first decide upon the goal of this archival system as
it will drive many of the inevitable decisions and trade-offs. Is the
purpose to accurately display published data on a reader's desk top? Is it
to allow readers to probe for previously hidden tidbits in published data?

Personally speaking, the former should be the goal of desktop/network
publishing. In analogy with x-ray structures and DNA sequences, the atomic
coordinates and nucleotide sequences are the reduced data associated with
the publication and are appropriate. The raw instrumental data is not.

2. The items that you mentioned, i.e., data format and processing software
are exemplary of the outcome of deep architectural issues. For example,
the data format depends on the degree of "open-ness" desired and the goals
of the archive. A text format for the data is very open (as it provides a
lowest common denominator for the readers) but is rather space inefficient
and be problematical for larger data sets. Binary formats (and there are
many) will be more space efficient and more arcane but not necessarily
impossible to understand.

3. The preference for FIDs suggests the desire to archive original data
with an eye toward reprocessing by readers, maybe not.

The NMR community seems almost unique in its desire to utilize the ADC
voltages. Maybe this desire is unavoidable but it seems that it does not
serve to assist in meeting the goal.

4. Will the creation of a archive standard be accomplished more
efficiently ($$$, time, etc.) by a free market approach or a more directed
program? For (free market) example, create a standard and developers will
begin to create programs that enable users to do what they want with the
archive. Directed examples are more varied but I think that you get the
idea.

5. Doesn't an IR archive standard exist now? The NMR community might
profit from a review of the process and the current status of the work
that has gone on before.

I truly hope that this helps and wish you great luck in achieving what is
likely to be a great undertaking.
------------------------------------------------------------------

I beleive the people who may have thought about this
in detail and have implemented some of these things
are at CarbBank (a database for complex carbohydrate
data). The people to contact are:

Dana Smith, Manager
Scott Doubet, Director
Peter Albersheim, Executive Director

tel: (206) 733-7183
fax: (206) 733-7283
email: 76424.1122@compuserve.com
------------------------------------------------------------------

netcdf may be worth exploring because it runs on many platforms
and the self describing headers could tolerate different formats.

this is from the README:

WHAT IS NETCDF?


NetCDF (network Common Data Form) is an interface for scientific
data access and a freely-distributed software library that
provides an implementation of the interface. It was developed by
Glenn Davis, Russ Rew, and Steve Emmerson at the Unidata Program
Center in Boulder, Colorado. The netCDF library also defines a
machine-independent format for representing scientific data.
Together, the interface, library, and format support the creation,
access, and sharing of scientific data.

netCDF data is:

+ Self-Describing: A netCDF file includes information about the
data it contains.

+ Network-transparent. A netCDF file is represented in a form
that can be accessed by computers with different ways of
storing integers, characters, and floating-point numbers.

+ Direct-access. A small subset of a large dataset may be
accessed efficiently, without first reading through all the
preceding data.

+ Appendable. Data can be appended to a netCDF dataset along
one dimension without copying the dataset or redefining its
structure. The structure of a netCDF dataset can be changed,
though this sometimes causes the dataset to be copied.

+ Sharable. One writer and multiple readers may simultaneously
access the same netCDF file.


HOW DO I GET THE NETCDF SOFTWARE PACKAGE?


Via anonymous FTP from
ftp.unidata.ucar.edu:pub/netcdf/netcdf.tar.Z.
------------------------------------------------------------------