Intended Users: from ordinary folk who want to see the value of nature around them in understandable language, to scientists who want to see a quick summary.
Data:
This is not another pie in the sky scheme that will sit empty waiting for data. This can go online as soon as you create a good database.
FOUR MAIN ENTITIES: plants (10s of thousands), chemical compounds contained in plants (10s of thousands), biochemical target-activities (hundreds), medicinal pharmacological uses-benefits (hundreds), all interlinked and rigorously referenced to a million scientific sources. Much of the data is in hierarchies (specie<genus<family, compound<types, arthritis<skeletal). This volume of data, with interconnection join tables, requires optimization so searchers don't fall asleep waiting for results. The recency and softness of biological data makes it more difficult to categorize than business data, making this project both challenging and interesting. The data is from a juncture of a wide variety of sources - from the subject areas of biochemistry, chemistry, ethnobotany, botany, toxicology, medicine, pharmacology, etc.
Project Tasks:
Schema design, normalization & denormalization. 4 main entities defines the beginning of the schema but intersect tables between them may slow search retrieval so denormalizing may be called for, and I wonder how you would approach a balance in that?
Parsing already taged data into tables & fields.
Input design that will include scrape capture of some of medline (pubmed.gov) fields, batch entry (list of plants with use or chem), some data mining to assist-semiautomate curration
Output design for speed and simplicity
Testing & synchronizing online with appropriate security
Refactoring
Later:
Cross correlation analytics.
Submitting chem structures to toxicity and target prediction and QSAR methods.
Using the library of phytocompounds for virtual screening against the growing number of targets
of medicinal interest.
Notes:
The nature of biological data is greater ambiguity in appropriate categorization, and field distinction, than typical business data, so there is a continual wrangling between precision and good-enough.
Complexity of project requires good communications and longevity.
Leaning towards Postgress for data integrity and Chemaxon compatibility.
I started this many years ago in text files with enough formating to facilitate parsing into fields and never got around to conversion. My head is into translating the science so much that it's hard to shift gears into db.
I previously created herbmed.org which gives a 20 year old idea of the simple language summaries that I write.
Since it is an academic project it might be ideal for teacher with good judgment with a couple students who want to learn outside the box and end with an interesting publication.
Intent is establishing input-editing on local computer with synchronization to web and later shift to opening permission online to qualified editors (with a review system).
I know just enough db to be dangerous. I've got structural ideas in my head with a bunch of many-many relations, but lack denormalization skills for search time optimization.
--
about me: bearbio - resume -
==
I would like to hear of any interest or questions you have about this project.
Nondisclosure required until publication.