Goldentime

Saturday, February 17, 2018

[codeFights] almostIncreasingSequence

Given a sequence of integers as an array, determine whether it is possible to obtain a strictly increasing sequence by removing no more than one element from the array.

Example

For sequence = [1, 3, 2, 1], the output should be
almostIncreasingSequence(sequence) = false;
There is no one element in this array that can be removed in order to get a strictly increasing sequence.
For sequence = [1, 3, 2], the output should be
almostIncreasingSequence(sequence) = true.
You can remove 3 from the array to get the strictly increasing sequence [1, 2]. Alternately, you can remove 2 to get the strictly increasing sequence [1, 3].

Input/Output

[execution time limit] 3 seconds (cs)
[input] array.integer sequence
Guaranteed constraints:
2 ≤ sequence.length ≤ 10⁵,
-10⁵ ≤ sequence[i] ≤ 10⁵.
[output] boolean
Return true if it is possible to remove one element from the array in order to get a strictly increasing sequence, otherwise return false.

So, my first thought was to calculate the longest increasing sequence for `sequence`, but it exceeded the time limit - of course, if the longest input can be 10^5, then O(n^2) algorithm should not work. One intuition is that longest increasing sequence calculates more things than we actually need. Here we can only have one element out of order, and thus decrease the difficulty of calculation.

Another note is that I start to use C# instead of python, for working purpose. My company mainly uses C#, and I want to learn.

bool almostIncreasingSequence(int[] sequence) {
    
    int leftMax = sequence[0];
    int leftMaxIndex = 0;
    int removed = 0;
    
    for(int i=1; i<sequence.Length; i++)
    {
        if(sequence[i]<=leftMax)
        {
            //need to remove leftMax
            if(removed >= 1)
                return false;
            //no items has been removed before
            if(leftMaxIndex > 0)
            {
                if (sequence[i] > sequence[leftMaxIndex-1])
                {
                    leftMax = sequence[i];
                    leftMaxIndex = i;
                }
            }
            else
            {
                leftMaxIndex = i;
                leftMax = sequence[i];
            }
            removed += 1;
        } 
        else
        {
            leftMaxIndex = i;
            leftMax = sequence[i];
        }
    }
    return true;
}

Tuesday, December 19, 2017

[Book Exercise] Regression Methods in Biostatistics

Chapter 2 exercise:

https://github.com/xinyutan/book_regression/blob/master/Ch2_EDM.ipynb

Chapter 3 exercise:

https://github.com/xinyutan/book_regression/blob/master/Ch3_exercise.ipynb

Thoughts: I feel low efficiency when reading by myself. I feel myself learning a lot better when I am in a class. Why is that? 1. I think listening to something is a better way for me to absorb information 2. It's better to have constant feedback and assurance. However, I must learn to learn by reading myself.

Chapter 4:

Notes: https://github.com/xinyutan/book_regression/blob/master/Notes_Chapter4LinearRegression.ipynb

Notes about different relationships between predictors.

Exercise: https://github.com/xinyutan/book_regression/blob/master/Ch4_exercise.ipynb

Wednesday, December 6, 2017

Questions regarding to risk stratification

As I finished reading the book Risk Stratification - A practical guide for clinicians, I get so confused by the "goal" of risk stratification. I thought it would be easy for me and I more or less understand, but the clinical concept is so confusing!

I will just take notes of some questions here. Hopefully, I will understand them later.

In Page 112, it says that only when the epidemiology problem is worked out, we should consider risk stratification. Then, what is the population-based epidemiology? What does it try to solve?

Okay, I found a dataset that I might be able to use for practice: http://archive.ics.uci.edu/ml/datasets/Thoracic+Surgery+Data

Friday, December 1, 2017

First view of risk stratification

Risk stratification is widely used in medicine.

At the first sight, risk stratification is nothing more than to learn the coefficients of linear/logistic regression. But if it's this simple, then why do medical professionals spend so much effort elaborate it? What's the difference? What are the things that we overlook in machine learning but need to pay special attention in medical context?

I am reading Risk Stratification - A practical guide for clinicians to find out.

==

By reading some paragraphs into it, the goals for two studies are very different. For machine learning classification, our goal is to predict as accurate as possible. So we use whatever methods that works the best. Normally, logistic regression is not so great. However, here, our goal is more focused on the "variables"; we want to know exactly how much this particular variable contributes to the outcome - linear model has an advantage in its clarity.

==

Regarding to data collection, be careful of harvesting effects where subtle effects that remove people from risk before they come to attention. We should always ask whether the data collection is complete , does it exclude certain population that we overlook (e.g., die before/during the intervention)?

==

About the difference between risk stratifications and clinical research study: the goal and the data are all different. For risk stratification, we want to know what are the other risk factors that cause the treatment result different. Therefore, the data will only contain the people with treatment (??? can't we make treatment a variable as well???). For clinical research study, however, we specifically want to know the effect of the treatment. Therefore, we want to keep all other factors the same, at least similar.

It seems that risk stratification is more on "understanding", clinical research studies more on "testing a well-defined hypothesis".

Several type of experiment design:

randomized trial (strict control)
cohort study (no manual control, only observe)
case-control (focus on positive cases. Due to the lack of whole population with a certain exposure/treatment, then not suitable for risk calculation.)

==

Rate standardization:
One good example is the mortality rate due to cancer. If we do not account the demographics change, we would conclude that the mortality rate increased 50% from 1940 to 1980. However, we also know that in 1980, there are a lot more senior population. If we take age difference into account, we would find that the mortality rate only increases 10%. This example shows a very thoughtful concern in medical problem.

Steps after calculating the Observed/Evaluated ratio (need to read more on statistics):

(chi-square for discrete data) statistical test
confidence interval
If the statistical test shows non-significant, check statistical power.

==

Some common concerns for machine learning task as well:
1. Need to consider if the training data and testing data follow the same distribution (features and labels).
2. Selection bias - is the treatment population selected non-randomly?

Sunday, September 3, 2017

[Leetcode] 388. Longest Absolute File Path

Suppose we abstract our file system by a string in the following manner:

The string "dir\n\tsubdir1\n\tsubdir2\n\t\tfile.ext" represents:

dir
    subdir1
    subdir2
        file.ext

The directory dir contains an empty sub-directory subdir1 and a sub-directory subdir2 containing a file file.ext.

The string "dir\n\tsubdir1\n\t\tfile1.ext\n\t\tsubsubdir1\n\tsubdir2\n\t\tsubsubdir2\n\t\t\tfile2.ext" represents:

dir
    subdir1
        file1.ext
        subsubdir1
    subdir2
        subsubdir2
            file2.ext

The directory dir contains two sub-directories subdir1 and subdir2. subdir1 contains a file file1.ext and an empty second-level sub-directory subsubdir1. subdir2 contains a second-level sub-directory subsubdir2 containing a file file2.ext.

We are interested in finding the longest (number of characters) absolute path to a file within our file system. For example, in the second example above, the longest absolute path is "dir/subdir2/subsubdir2/file2.ext", and its length is 32 (not including the double quotes).

Given a string representing the file system in the above format, return the length of the longest absolute path to file in the abstracted file system. If there is no file in the system, return 0.

Note:

The name of a file contains at least a . and an extension.
The name of a directory or sub-directory will not contain a ..

Time complexity required: O(n) where n is the size of the input string.

Notice that a/aa/aaa/file1.txt is not the longest file path, if there is another path aaaaaaaaaaaaaaaaaaaaa/sth.png.

I solved this problem about 7 months ago - when I first started to tackle these Leetcode problems. Back then, I wasn't proficient in any algorithms. But somehow, I managed to solve this problem with an extremely messy code. Ever since I got that job offer, I thought I should re-do some of the problems or focus those problems that require some modeling (i.e., Google problems).

This problem is on the easier side of this kind. It is very obvious files and folders are organized in tree structure. Only if we have a well-organized tree structure, then we can basically use backtracking to find out the answers. However, the input is a string. But, by looking at the string, we can see that the number of '\t' in the element is the same as the level of that elements. For example, '\tsubdir1', '\t\tfile.txt' are in the first, second level of the file organization, respectively.

We can't easily do a tree traverse on this type of string input, but hey, the elements order after input.split('\n') is similar to the in-order traversal of the directory tree (here might be a forest).

Therefore, the abstraction of the problem will be:

Given the in-order traversal of a tree and each element's level, reconstruct the maximum length path from root to leaf where we only considers those leaves that contain '.' (file).

Well, then we just use a stack, when the level is higher than the current level, meaning that the incoming element will be the children of current element; when lower, meaning that the incoming one is the parent; the same, siblings.

class Solution(object):
    def lengthLongestPath(self, input):
        """
        :type input: str
        :rtype: int
        """
        stack = []
        current_level = 0
        res = 0
        for name in input.split('\n'):
            #print stack, current_level
            tabs = name.split('\t')
            if len(tabs) - 1 == current_level:
                if stack:
                    stack.pop()
                stack.append(tabs[-1])
            elif len(tabs) -1 > current_level:
                stack.append(tabs[-1])
            else:
                for _ in range(current_level - len(tabs) + 2):
                    if stack: stack.pop()
                stack.append(tabs[-1])
            if '.' in tabs[-1]:
                res = max(res, len('/'.join(stack)))
            current_level = len(tabs) - 1
        return res

[Update] The end of a period

I got a job offer in a small company in town - it matches my expectation in almost every aspect: meaningful domain, the opportunity to learn and improve, and nice pay.

The interview did not use any of the stuff I have prepared for the last few month - days and nights going through, sometimes painfully, all these Leetcode problems. But I am glad I did it. By doing these algorithmic problems, I have gained a much deeper understanding of common algorithms, been more appreciative of good and efficient code, and cultivated a habit to endure some hard and lonely moments in life.

My progress in Leetcode - I actually never have imagined that I could have done so many problems.

I like doing these problems, so I will likely keep updating some interesting problems I have solved, but to a less intensity. This might be even better, as now I am in a much more relaxing mood - I might be able to think deeper than just finishing as much problems as possible.

I look forward to this new chapter of my life

Sunday, August 27, 2017

[Leetcode] 334. Increasing Triplet Subsequence

This problem reminds me of the fancy version of longest increasing subsequence problem ($O(n\log n)$ solution).

Let's look at an example array: 1, 2, -1, 1, -2, 3. When we go through the first two element, we are all good. Then, we encounter -1, here we face a decision, should we update the current minimum to -1?

Let's see why we should and why we shouldn't. -1 is a safer move than 1, what if later our solution is -1, 0, 1? Then we definitely want to update it. However, you might say, if we update it to -1, then 2 shouldn't be in our solution anymore, according to the "subsequence" rule.

Well, the answer is YES, but we only update the minimum, keeping the second minimum (which is 2 here).The argument is as following: when we keep the second minimum (call it min2), this means that we only count increasing when the next value is greater than min2. You see, by changing min1 to a smaller value, it doesn't really affect the decision we are going to make about adding an increasing element. However, if we see the next value is smaller than min2 but greater than min1, we then want to update min2.

In this way, we always keep track the safest possible value yet maintain the order we have for now.

In actual implementation, I used a stack instead of min1, min2. It turned out to be much easier to handle. Well, who knows such a seeming simple question took me good two hours to think through - this kind of question is very tricky.

Given an unsorted array return whether an increasing subsequence of length 3 exists or not in the array.

Formally the function should:

Return true if there exists i, j, k
such that arr[i] < arr[j] < arr[k] given 0 ≤ i < j < k ≤ n-1 else return false.

Your algorithm should run in O(n) time complexity and O(1) space complexity.

Examples:
Given [1, 2, 3, 4, 5],
return true.

Given [5, 4, 3, 2, 1],
return false.

class Solution(object):
    def increasingTriplet(self, nums):
        """
        :type nums: List[int]
        :rtype: bool
        """
        if len(nums) < 3: return False
        
        counter = 1
        stack = []
        for i in nums:
            if len(stack) == 0:
                stack.append(i)
            elif i < stack[0]:
                stack[0] = i
            elif len(stack) == 1 and i > stack[0]:
                stack.append(i)
            elif len(stack) == 2 and stack[0] < i < stack[1]:
                stack[1] = i
            elif len(stack) == 2 and i > stack[1]:
                return True
        return False