If you’re wanting to do research involving anything even remotely biology data science-y, I HIGHLY recommend getting at least a basic foundation in command line stuff (so you can download packages, access servers, etc.) and Python (a coding language used in a lot of biology computer science stuff, so you can write your own code or edit that which others share on GitHub, etc.). Also, a tool which I used a little in undergrad and a little in grad school but never really deeply got into it but am now wishing I had… R. This is used a lot for statistics as well as plotting really nice graphs, charts, etc. Sooo much better than Excel, and free unlike Prism (which also has less customizability). Customizability is a huge reason to learn some coding.
And you don’t have to start from scratch. People who really know what they’re doing post their code on GitHub – there’s soooooo much free stuff here. It’s pretty wild how you can just go “shopping” for scripts but it’s all free! And easy to download. A lot of times people will post custom code they used for their publications and then you can customize it to analyze your own data or whatever you want.
But there is a bit of a learning curve. Or curves, plural, really because each programming language (e.g. R, Python, bash) uses its own syntax and there’s also the computer aspects of things – figuring out servers and all that stuff. So my biggest advice is to start early and practice frequently so you don’t forget it all. The exact tools you will need to know will vary but a solid foundation in the basics will get you far (with a little help from your friends and a LOT of help from Google!)
A couple other things which are good to get good at (or at least be familiar with):
- When you work in R you often use R studio, which is a nice “IDE” for working with R (an integrated development environment – basically you write in code but using a GUI (graphical user interface) so you can see your variables and tables and figures and stuff)
- also with R you can find lots of packages in CRAN and lots of custom scripts in bioconductor
- For working in Python, Jupyter Notebooks are really helpful
- these are web browser based, but your computer is doing the work. You can also run it on a remote server (great for if you’re dealing with big and/or computationally-heavy files) but then copy and paste a link it gives you into a browser on your local computer to view it
- Anaconda – this lets you create virtual environments where you can install all the “right” versions of all the programs, packages, etc. you need to run your scripts. This way, you can have multiple versions on your computer and make sure you use the one the one is program is written to use. (updates and syntax changes can cause problems if a script is expecting one but you give it another). If you Google something like “Anaconda import thing” you can usually find a line of code you can copy and past to import it into your conda environment. When you download Python from Anaconda it also gives you a lot of the common software packages you’ll want in biology
- XQuartz – this allows you to view graphical interfaces running from a remote server
- Galaxy – this has a bunch of web-based tools you can use to analyze genomics data
Get used to Googling! A lot of Googling! This is true for all of science, but especially coding stuff. A quick Google search will often find you what you’re looking for.
Learn the quick codes for getting help with commands, functions, etc.
- in R, use help(thing_you_want_help_with) or ? thing_you_want_help_with
- in Python, use help(thing_you_want_help_with)
- in Terminal, use man thing_you_want_help_with
Don’t put spaces in your file names. Instead use underscores.
Where applicable, talk to your IT department or lab mates or whoever can help to learn how you can connect to your institution’s remote server so you can run process-heavy stuff. You typically will need to use something like “ssh myname@servername.” Also learn how to move things to and from the server (e.g. with “scp” – and technical note that if you want to move folders (aka directories), include -r for recursive).
A couple really key bash commands are
- pwd tells you your Present Working Directory – where is the terminal running from – this will dictate what files it has access to
- cd lets you Change Directory if you want to move to a different one
- cd .. takes you up a level in directories
- ls lists what’s in a directory
- rm deletes something
- makdir makes a new directory (folder)
- use up arrow to go up to previous commands you’ve typed in
- use tab to autofill file names, etc.
A great tutorial on Python, Jupyter Notebooks and more from my friend and grad school colleague Dr. Shaina Lu. URP 2021 Programming Course: https://github.com/shainalu/URP_2021_Programming_Course
Here’s a whole free course on R from UC Davis: R-DAVIS (R–Data Analysis & Visualization In Science) https://gge-ucd.github.io/R-DAVIS/index.html
Some good YouTube videos:
- An introduction to the R programming language for Bioinformatics students – Part 1/2 & 2/2, from Stephen Guest, Ph.D., University of Michigan Computational Medicine and Bioinformatics: https://youtu.be/bekFrlW0gww & https://youtu.be/LaFZUad6zXQ
- Jupyter Notebook Tutorial from Project Data Science: https://youtu.be/DKiI6NfSIe8