CS173: Intro to Computer Science - DNA Mutations (100 Points)

Assignment Goals

The goals of this assignment are:
  1. To manipulate strings using the substring method
  2. To apply string indexing in the context of iteration
  3. To select and implement appropriate test cases to check boundary cases and potential off-by-one errors
  4. To memory map strings on paper and to validate that mapping using the step debugger

Background Reading and References

Please refer to the following readings and examples offering templates to help get you started:

The Assignment

Nucleic Acid (NA) Sequences like DNA and RNA are organic chains of phosphates and sugars that encode that encode genetic information about living things. These chains mutate in various ways through reproduction and as a result of cellular damage (for example, through UV from sunlight). We represent the four DNA nucleotide bases as (A)denine, (C)ytosine, (G)uanine, and (T)hymine, using the letters A, C, G, and T, respectively. An example DNA chain might be represented as AATCC. Here is an example RNA chain from Wikipedia (which uses additional nucleotide bases such as (U)racil, represented by the letter U).

An RNA Chain from Wikipedia

In this assignment we will restrict ourselves to DNA bases A, C, G, and T. Each of these bases has a complement, as follows:

  • The complement of A is T (and vice-versa)
  • The complement of C is G (and vice-versa)

A substring of DNA bases (for example, ACTG) might appear later in the chain in the form of its complement, in reverse. That is, the complement of ACTG is TGAC (the A became a T, the C became a G, the T became an A, and the C became a G). In this example, ACTG is referred to as a sense, and its reversed complement is called an antisense. When we reverse TGAC, we obtain CAGT, and this is the antisense of the sense ACTG. A DNA chain with this sense/antisense pattern might be represented as ACTG…CAGT (with zero or more additional bases occurring in the middle).

Part 1: Detecting Differences in NA Chains

Given two strings, compute their percentage difference, character by character. If the lengths are different, the difference of their lengths count as differences as well. For example, “ACCG” and “ACTG” would differ by 25%, because 1 character is different out of the 4. “ACCG” and “ACCGT” would differ by 20%, because 1 character is different out of a possible 5 (the length of the largest string). This can be represented as a double (for example, 0.25 for 25%, and 0.2 for 20%).

Part 2: Inserting a Chain

Given an NA chain string, an NA subchain, and a position, insert the subchain into the chain. For example, insert("ACCG", "TT", 2) would return “ACTTCG” (recall that the indices start at 0, so the TT occupies position 3 and 4 in the string, which are indices 2 and 3.

Part 3: Detecting an Antisense

Next, write a function to search one NA chain for the existence of an antisense. As a parameter, pass the sense chain whose antisense you are searching for. Return a boolean if you find it. This function will need to call two other functions - one to compute the complement of the sense (the “antisense”), and next function to reverse that antisense chain string. If you do those two things first, detecting the antisense becomes a simpler matter of searching one string for another.

An Example

For example, public static boolean detectAntisense(String chain, String sense) should compute the complement of sense, reverse the variable sense, and then search for sense in chain. As a specific example,

detectAntisense("ACATGCTATGTA", "ACAT");

should compute the complement of the sense ACAT (which is TGTA), and then reverse it to obtain the antisense (which is ATGT). Finally, return true if the antisense ATGT is found in the chain (which is ACATGCTATGTA) – and in this case, it is (so we return true)!

You can use the string indexOf() method to search for one String inside another. If indexOf returns -1, meaning you did not find the antisense in the chain, return false. Otherwise, return true.

Looping over a String

To loop over every character of a string, you can loop as follows:

for(int i = 0; i < str.length(); i++) {
    System.out.println(str.charAt(i)); // as an example, this prints every character in the string!
}

Or, to loop backwards (for example, to reverse a String), you can try this:

String reversed = "";
for(int i = str.length() - 1; i >= 0; i--) { // why str.length() - 1?
    reversed = reversed + str.charAt(i); // append to the new String, one character at a time
                                         // from the end to the beginning of the original String
}

When manipulating string indices, it is very important to avoid off-by-one errors. For example, the substring(start, end) method starts at the index start but ends at the index end - 1, which can be confusing. Compounding the confusion is that string indices start counting at 0, like most arrays do. From the javadoc documentation for substring, we see that the substring(1, 5) of “smiles” is “mile”.

If you are looping through your string, take care to stop searching not only prior to the end of the string (as you normally would), but at the point at which the substring you are searching for would run past the end of the string. For example, if you are searching for “ACC” in “CGCTG”, you could stop searching once you reach the “T”, because there is no room for “ACC” to fit there. Your loop terminating condition will be a value less than chain.length(), where chain is the sequence you’re searching within (i.e., “CGCTG”) - what should it be instead (hint: it is related to subchain.length(), where subchain is the sequence you’re searching for, i.e. “ACC”)?

Prior to writing your code, draw a grid representing the string you’re searching for, and number the indices from 0 to the length of the string. Then draw a grid representing the chain you’re searching within (again, from 0 to the length of that chain). Step through the search procedure on paper so that you can see the indices that you’ll be working with. How do your indices relate to the lengths of the source and target subchains? This will provide you the answers you need to implement your algorithm!

Part 3, Step 1: Computing the Complement of an NA Sense

Given an NA chain string, return a new string representing its complement. This should be a string of the same length as the original; however, each character should be replaced with its complement. That is, all the A’s should be switched to T’s (and vice-versa), and all the C’s should become G’s (and vice-versa). So, the sense ACAT becomes TGTA.

Question: why do you need to return a new string? Since strings are represented as arrays of characters, why can’t you manipulate the input string paramter directly? Even if you could, why do you think it is a good idea to create a new string anyway?

Part 3, Step 2: Reversing the NA Sense

Given an NA chain string, return a new string of the same length but with all the characters reversed. That is, “ATCG” becomes “GCTA”. You will reverse the complement of the sense that you computed in Step 1. So, in our example, TGTA (the complement of the sense ACAT) becomes ATGT.

Part 3, Step 3: Compute an Antisense

Using the two functions you just wrote to compute the complement and the reverse of a chain, write a function to compute the antisense of a given NA chain. To do this, compute the complement of the chain, and then reverse it. To do this, simply call the function you wrote for Step 1 to compute the complement, passing it the sense (not the original chain, but the sense you’re searching for; for example, if your primary function is detectAntisense("ACATGCTATGTA", "ACAT");, you would compute this on ACAT); then, call your Step 2 function to reverse the result.

Part 3, Step 4: Find the Antisense

Finally, determine if the antisense is located inside the chain. So, in our example, you would search for ATGT in ACATGCTATGTA, because the sense you are seeking is ACAT. You can use the chain.indexOf() to help you here - feel free to look it up to see how it works!

Part 4: Removing a Chain

Given an NA chain string and an NA subchain, remove all instances of the subchain from the chain. For example, remove("ACCGCC", "CC") would return “AG”.

Note that Java Strings now have a method replaceAll that will do this for you. I definitely encourage you to use this. However, you may notice that these helper methods don’t always exist across many of the string operations we’re exploring here. So, there is significant value in practicing with string indexing. For full credit on this problem, implement your own replacement algorithm to accomplish this without calling a replace or replaceAll string method. However, it would be a good idea to write a unit test that compares your results to a call to replaceAll, and you should feel both free and encouraged to do so!

So, to do this, you will need to read the String one character at a time, and determine whether the substring is equal to your subsequence. Append the character to your result String if they don’t match, and advance your loop counter if they do match (so that you skip those characters). Notice that I’m not asking you to remove the characters directly from the original String! Although you could do this, you will have to do some extra work to update your loop counter if you make a change to the String.

Question: for the sequence ACCGCC, replacing the subsequence CC, what pairs of characters do you need to compare to CC? For example, you would first check AC, but then what are the rest, and what are their substring begin and end indices? List out each pair of characters, and their indices. These should give you a hint about how your loop will work.

Part 5: Testing

Write unit tests for each part of this assignment. You will need more than one test case per part. Your test cases should include boundary conditions (for example, inserting or removing to the beginning, middle, and end of a string). Your goal is to uncover errors with your test cases. Because of 0-indexing, and off-by-one errors, an algorithm can appear to work fine as long as you manipulate the middle of a chain, but then break when you are dealing with the beginning or end. Even I made an error while completing this assignment that was only uncovered when I tried to execute it at the end of a chain. These mistakes are very easy to make, and you should assume that your code contains these bugs. Think of testing like a game: your goal is to cause your software to break (that’s how we identify bugs to fix and make our software more robust!).

Question: What test cases would you write in order to try to do that?

Exporting your Project for Submission

When you’re done, write a README for your project, and save all your files, before exporting your project to ZIP. In your README, answer any bolded questions presented on this page. Here is a video tutorial describing how to write a README for your project, and how to export it.

Design Questions to Help You Begin

Please answer the following questions in your README file before you begin writing your program.
  1. Given a String "Computing", what beginning and ending indices would you pass to substring to retrieve the letters "put"?
  2. Suppose you have a String x = "CS" and a String y = "173". How would you create a String z that combines the two strings to be CS 173 without re-typing CS or 173?
  3. Suppose you have a String x = "hamburger", and you wish to change it to "cheeseburger". What calls to x.substring() would allow you to do this?

Submission

In your submission, please include answers to any questions asked on the assignment page in your README file. If you wrote code as part of this assignment, please describe your design, approach, and implementation in your README file as well. Finally, include answers to the following questions:
  • Describe what you did, how you did it, what challenges you encountered, and how you solved them.
  • Please answer any questions found throughout the narrative of this assignment.
  • If collaboration with a buddy was permitted, did you work with a buddy on this assignment? If so, who? If not, do you certify that this submission represents your own original work?
  • Please identify any and all portions of your submission that were not originally written by you (for example, code originally written by your buddy, or anything taken or adapted from a non-classroom resource). It is always OK to use your textbook and instructor notes; however, you are certifying that any portions not designated as coming from an outside person or source are your own original work.
  • Approximately how many hours it took you to finish this assignment (I will not judge you for this at all...I am simply using it to gauge if the assignments are too easy or hard)?
  • Your overall impression of the assignment. Did you love it, hate it, or were you neutral? One word answers are fine, but if you have any suggestions for the future let me know.
  • Any other concerns that you have. For instance, if you have a bug that you were unable to solve but you made progress, write that here. The more you articulate the problem the more partial credit you will receive (it is fine to leave this blank).

Assignment Rubric

Description Pre-Emerging (< 50%) Beginning (50%) Progressing (85%) Proficient (100%)
Algorithm Implementation (40%) The algorithm fails on the test inputs due to major issues, or the program fails to compile and/or run The algorithm fails on the test inputs due to one or more minor issues The algorithm is implemented to solve the problem correctly according to given test inputs, but would fail if executed in a general case due to a minor issue or omission in the algorithm design or implementation A reasonable algorithm is implemented to solve the problem which correctly solves the problem according to the given test inputs, and would be reasonably expected to solve the problem in the general case
Test Cases (20%) Testing was performed outside of the unit test framework, or not performed at all Trivial test cases are provided in a unit test framework Test cases that cover some, but not all, boundary cases and branches of the program are provided Test cases that cover all boundary cases and branches of the program are provided
Code Quality and Documentation (30%) Code commenting and structure are absent, or code structure departs significantly from best practice, and/or the code departs significantly from the style guide Code commenting and structure is limited in ways that reduce the readability of the program, and/or there are minor departures from the style guide Code documentation is present that re-states the explicit code definitions, and/or code is written that mostly adheres to the style guide Code is documented at non-trivial points in a manner that enhances the readability of the program, and code is written according to the style guide, and each function contains relevant and appropriate Javadoc documentation
Writeup and Submission (10%) An incomplete submission is provided The program is submitted, but not according to the directions in one or more ways (for example, because it is lacking a readme writeup or missing answers to written questions) The program is submitted according to the directions with a minor omission or correction needed, including a readme writeup describing the solution and answering nearly all questions posed in the instructions The program is submitted according to the directions, including a readme writeup describing the solution and answering all questions posed in the instructions

Please refer to the Style Guide for code quality examples and guidelines.