Curating additional attributes archived in the ChIP-Atlas
Setting up the things
code:bash
$ awk -F'\t' '$2 == "hg38" && $3 == "TFs and others" && $5 == "Blood" { for(i=10; i<=NF; ++i) print $i }' experimentList.tab | cut -d "=" -f 1 | sort -u | yllm --api-url http://localhost:11434/v1/chat/completions --model llama2:13b --stdin It seems like you have provided a list of over 200 different terms and phrases related to molecular biology, bioinformatics, and clinical research. These terms are commonly used in the field, and they cover a wide range of topics, including gene expression, genotyping, phenotyping, cell culture, and clinical trials.
Here are some general observations and suggestions based on the list you provided:
1. Many of these terms are highly specialized and may only be used in specific contexts or communities. For example, "chy-seq antibody" and "ectopic expression" are likely to be familiar only to researchers working in certain subfields.
2. Some of the terms you provided are more general and could be applied across multiple areas of research. Examples include "cell culture," "gene expression," "phenotype," and "treatment."
3. There is a lot of overlap between the terms you provided and those used in other fields, such as medicine, biotechnology, and bioengineering. This highlights the interdisciplinary nature of molecular biology and related fields.
4. Some of the terms you provided are more specific to certain organisms or systems. For example, "lymphoma cell line" and "tumor stage" are more specific to cancer research, while "knockdown" and "overexpression" could apply to a wide range of organisms and research areas.
5. There are many different abbreviations and acronyms used in molecular biology, such as "INSDC," "GEO," "KDM5A," and "SF3B1." These can be helpful for brevity and clarity, but they may also be unfamiliar to some readers.
6. Some of the terms you provided are more related to clinical research, such as "patient diagnosis," "treatment," and "clinical trial." These terms highlight the importance of translating basic scientific discoveries into practical applications for human health.
Overall, your list provides a comprehensive overview of many of the key terms and phrases used in molecular biology and related fields. By understanding these terms, researchers can effectively communicate their findings and collaborate with others in the field.
code:bash
$ cat <(echo "Here is the list of biological terms used in the experimental sample annotation. Make a new list of terms based on the list by removing redundancies. The new list should be as long as possible.\n") <(awk -F'\t' '$2 == "hg38" && $3 == "TFs and others" && $5 == "Blood" { for(i=10; i<=NF; ++i) print $i }' experimentList.tab | cut -d "=" -f 1 | sort -u) | l | tee list.txt
Here is the list of biological terms used in the experimental sample annotation, with redundancies removed:
1. Activation
2. Antibody
3. ArrayExpress-CellType
4. Background
5. Barcode
6. Biological Replicate
7. Biosample Type
8. Cancer Type
9. Cell
10. Cell Culture Condition
11. Cell Description
12. Cell Differentiation Protocol
13. Cell Karyotype
14. Cell Line
15. Cell Number
16. Cell Organism
17. Cell Phenotype
18. Cell Population
19. ChIP
20. Clone
21. Condition
22. Construct
23. Control
24. Culture
25. Data Type
26. Fixation Method
27. Gene Fusion
28. Genotype
29. Growth Media
30. Health State
31. Host Strain
32. Immunoprecipitate
33. Infection
34. Knockdown
35. Library Prep Protocol
36. Lymphoma Cell Line
37. Material Type
38. Method
39. Modification to Cells
40. Morphology
41. Multiplex Barcode
42. NGS Seq
43. Overexpression
44. Parent Cell Line
45. Passage
46. Phenotype
47. Protocol
48. Protein Expression
49. Purification Target
50. Reference Genome
51. Sample Comment
52. Sample Description
53. Sample Name
54. Sample Pool
55. Sample Type
56. Sex
57. ShRNA
58. Sgrna
59. Sirna
60. Software Version
61. Source
62. Subject Id
63. Submitter Id
64. System
65. Tag
66. Target
67. Time
68. Time Point
69. Tissue
70. Tissue Type
71. Tissue Origin
72. Tissue Source
73. Transduced Gene
74. Transduced with
75. Transfection
76. Transgene
77. Treatment
78. Treatment Concentration
79. Treatment Condition
80. Treatment Description
81. Treatment Dose
82. Treatment Drug
83. Treatment Durations
84. Treatment Group
85. Treatment Length
86. Treatment Molecule
87. Treatment Protocol
88. Treatment Time
89. Treatment Timepoint
90. TrueSeq Barcode
91. Tumor Classification
92. Tumor Stage
93. Tumor Type
94. Vector
95. Vendor
96. Viral Infection
97. Viral Strain
98. Wt or Kdm5a Ko
99. Young Id
Note that some of the terms may still have slight variations or synonyms, but this list should capture the majority of the biological concepts and terms used in the annotation.
code: bash
$ awk -F'\t' '$1 == "ERX989298" && $2 == "hg38"' experimentList.tab | tr '\t' '\n'
ERX989298
hg38
TFs and others
CTCF
Blood
K-562
Primary Tissue=Blood|Tissue Diagnosis=Leukemia Chronic Myelogenous
18846245,98.2,2.9,49764
Illumina HiSeq 2000 paired end sequencing
ArrayExpress-CellType=K562 myelogenous leukemia
ArrayExpress-Immunoprecipitate=anti-CTCF
ArrayExpress-Phenotype=Normal
ArrayExpress-Species=Homo sapiens
ENA first public=2015-06-09
ENA last update=2018-03-08
ENA-CHECKLIST=ERC000011
External Id=SAMEA3267186
INSDC center alias=SC
INSDC center name=Wellcome Sanger Institute
INSDC first public=2015-06-09T14:27:35Z
INSDC last update=2018-03-08T23:29:06Z
INSDC status=public
Submitter Id=K562_CTCF_1-sc-2232821
common name=human
sample description=CTCF ChIP DNA
sample name=K562_CTCF_1-sc-2232821
code:bash
cat <(echo -e "<s>INST Here is a curated list of experimental sample attribute terms used in biology. Rewrite the given metadata description using the corresponding terms in the list instead of the original sample attributes. Output only the attributes with the values.\n# The list of attributes\n") <(cat list.txt) <(echo -e "\n# Sample description to be annotated\n") <(awk -F'\t' '$1 == "ERX989298" && $2 == "hg38"' experimentList.tab | tr '\t' '\n') <(echo -e "\n# Sample description annotated with the curated terms\n") | yllm --api-url http://localhost:11434/v1/chat/completions --model llama2:13b --stdin | tee annotated.ERX989298.txt code:prompt
$ cat list.txt | awk '{$1=""; print $0 }' | while read term; do echo -e "<s>INST Find an attribute which describes about the term \"${term}\" from the given sample description, written in a key=value form. If you found a corresponding attribtue value, your answer should be like ${term}=value. If the description of ${term} was not found, output "Not found". Only output, no other text.\n\n# Sample description\n\n $(awk -F'\t' '$1 == "ERX989298" && $2 == "hg38" { for(i=9;i<=NF;i++){ print $i } }' experimentList.tab | tr '\t' '\n')/INST" | yllm --api-url http://localhost:11434/v1/chat/completions --model llama2:13b --stdin; done | tee annotated.ERX989298.txt code:prompt
<s>INST Find an attribute which describes about the term "Activation" from the given sample description, written in a key=value form. If you found a corresponding attribtue value, your answer should be like Activation=value. If the description of Activation was not found, output Not found. Only output, no other text. # Sample description
Illumina HiSeq 2000 paired end sequencing
ArrayExpress-CellType=K562 myelogenous leukemia
ArrayExpress-Immunoprecipitate=anti-CTCF
ArrayExpress-Phenotype=Normal
ArrayExpress-Species=Homo sapiens
ENA first public=2015-06-09
ENA last update=2018-03-08
ENA-CHECKLIST=ERC000011
External Id=SAMEA3267186
INSDC center alias=SC
INSDC center name=Wellcome Sanger Institute
INSDC first public=2015-06-09T14:27:35Z
INSDC last update=2018-03-08T23:29:06Z
INSDC status=public
Submitter Id=K562_CTCF_1-sc-2232821
common name=human
sample description=CTCF ChIP DNA
sample name=K562_CTCF_1-sc-2232821/INST should be like this
code:prompt
<s>INST From the given sample description, find a decription about the attribute "Activation". When you find a corresponding attribute value, answer Activation=value. If not found, output NA. Only output, say nothing else. # Sample description
Illumina HiSeq 2000 paired end sequencing
ArrayExpress-CellType=K562 myelogenous leukemia
ArrayExpress-Immunoprecipitate=anti-CTCF
ArrayExpress-Phenotype=Normal
ArrayExpress-Species=Homo sapiens
ENA first public=2015-06-09
ENA last update=2018-03-08
ENA-CHECKLIST=ERC000011
External Id=SAMEA3267186
INSDC center alias=SC
INSDC center name=Wellcome Sanger Institute
INSDC first public=2015-06-09T14:27:35Z
INSDC last update=2018-03-08T23:29:06Z
INSDC status=public
Submitter Id=K562_CTCF_1-sc-2232821
common name=human
sample description=CTCF ChIP DNA
sample name=K562_CTCF_1-sc-2232821/INST code:bash
$ cat list.txt | awk '{$1=""; print $0}' | while read term; do echo -e "<s>INST From the given sample description, find a decription about the attribute \"${term}\". When you find a corresponding attribute value, answer ${term}=value. If not found, the value should be NA. Only output, say nothing else.\n\n# Sample description\n\n $(awk -F'\t' '$1 == "ERX989298" && $2 == "hg38" { for(i=9;i<=NF;i++){ print $i } }' experimentList.tab | tr '\t' '\n')/INST" | yllm --api-url http://localhost:11434/v1/chat/completions --model llama2:13b --stdin; done | tee annotated.ERX989298.txt code:txt
Activation=NA
Antibody=anti-CTCF
ArrayExpress-CellType=K562 myelogenous leukemia
Background=NA
Barcode=NA
Biological Replicate = NA
Biosample Type=ChIP DNA
Cancer Type=myelogenous leukemia
Cell = NA
Cell Culture Condition = NA
Cell Description=NA
Cell Differentiation Protocol=NA
Cell Karyotype=NA
Cell Line=K562 myelogenous leukemia
Cell Number=NA
Cell Organism = human
Cell Phenotype=NA
Cell Population = NA
ChIP = NA
Clone=NA
Condition=NA
Construct = NA
Control = NA
Culture=NA
Data Type=NA
Fixation Method=NA
Gene Fusion = NA
Genotype=NA
Growth Media=NA
Health State=NA
Host Strain = NA
Immunoprecipitate = anti-CTCF
Infection = NA
Knockdown=NA
Library Prep Protocol=NA
Lymphoma Cell Line=NA
Material Type = NA
Method=NA
Modification to Cells = NA
Morphology=NA
Multiplex Barcode=NA
NGS Seq = NA
Overexpression=NA
Parent Cell Line=NA
Passage = NA
Phenotype=Normal
Protocol=NA
Protein Expression=NA
Purification Target = NA
Reference Genome=NA
Sample Comment=NA
Sample Description = CTCF ChIP DNA
Sample Name = K562_CTCF_1-sc-2232821
Sample Pool: NA
Sample Type=paired end sequencing
Sex=NA
ShRNA=NA
Sgrna=NA
Sirna=NA
Software Version=NA
Source=NA
Subject Id=SAMEA3267186
Submitter Id=K562_CTCF_1-sc-2232821
System = NA
Tag=NA
Target = NA
Time=NA
Time Point = NA
Tissue=NA
Tissue Type=NA
Tissue Origin=NA
Tissue Source=NA
Transduced Gene=NA
Transduced with=NA
Transfection=NA
Transgene=NA
Treatment=NA
Treatment Concentration = NA
Treatment Condition=NA
Treatment Description = NA
Treatment Dose=NA
Treatment Drug=NA
Treatment Durations=NA
Treatment Group=NA
Treatment Length=NA
Treatment Molecule=anti-CTCF
Treatment Protocol=NA
Treatment Time=NA
Treatment Timepoint=NA
TrueSeq Barcode=NA
Tumor Classification=NA
Tumor Stage=NA
Tumor Type=NA
Vector=NA
Vendor=NA
Viral Infection=NA
Viral Strain=NA
Wt or Kdm5a Ko=NA
Young Id=NA