Goldentime: 2017

Tuesday, December 19, 2017

[Book Exercise] Regression Methods in Biostatistics

Chapter 2 exercise:

https://github.com/xinyutan/book_regression/blob/master/Ch2_EDM.ipynb

Chapter 3 exercise:

https://github.com/xinyutan/book_regression/blob/master/Ch3_exercise.ipynb

Thoughts: I feel low efficiency when reading by myself. I feel myself learning a lot better when I am in a class. Why is that? 1. I think listening to something is a better way for me to absorb information 2. It's better to have constant feedback and assurance. However, I must learn to learn by reading myself.

Chapter 4:

Notes: https://github.com/xinyutan/book_regression/blob/master/Notes_Chapter4LinearRegression.ipynb

Notes about different relationships between predictors.

Exercise: https://github.com/xinyutan/book_regression/blob/master/Ch4_exercise.ipynb

Wednesday, December 6, 2017

Questions regarding to risk stratification

As I finished reading the book Risk Stratification - A practical guide for clinicians, I get so confused by the "goal" of risk stratification. I thought it would be easy for me and I more or less understand, but the clinical concept is so confusing!

I will just take notes of some questions here. Hopefully, I will understand them later.

In Page 112, it says that only when the epidemiology problem is worked out, we should consider risk stratification. Then, what is the population-based epidemiology? What does it try to solve?

Okay, I found a dataset that I might be able to use for practice: http://archive.ics.uci.edu/ml/datasets/Thoracic+Surgery+Data

Friday, December 1, 2017

First view of risk stratification

Risk stratification is widely used in medicine.

At the first sight, risk stratification is nothing more than to learn the coefficients of linear/logistic regression. But if it's this simple, then why do medical professionals spend so much effort elaborate it? What's the difference? What are the things that we overlook in machine learning but need to pay special attention in medical context?

I am reading Risk Stratification - A practical guide for clinicians to find out.

==

By reading some paragraphs into it, the goals for two studies are very different. For machine learning classification, our goal is to predict as accurate as possible. So we use whatever methods that works the best. Normally, logistic regression is not so great. However, here, our goal is more focused on the "variables"; we want to know exactly how much this particular variable contributes to the outcome - linear model has an advantage in its clarity.

==

Regarding to data collection, be careful of harvesting effects where subtle effects that remove people from risk before they come to attention. We should always ask whether the data collection is complete , does it exclude certain population that we overlook (e.g., die before/during the intervention)?

==

About the difference between risk stratifications and clinical research study: the goal and the data are all different. For risk stratification, we want to know what are the other risk factors that cause the treatment result different. Therefore, the data will only contain the people with treatment (??? can't we make treatment a variable as well???). For clinical research study, however, we specifically want to know the effect of the treatment. Therefore, we want to keep all other factors the same, at least similar.

It seems that risk stratification is more on "understanding", clinical research studies more on "testing a well-defined hypothesis".

Several type of experiment design:

randomized trial (strict control)
cohort study (no manual control, only observe)
case-control (focus on positive cases. Due to the lack of whole population with a certain exposure/treatment, then not suitable for risk calculation.)

==

Rate standardization:
One good example is the mortality rate due to cancer. If we do not account the demographics change, we would conclude that the mortality rate increased 50% from 1940 to 1980. However, we also know that in 1980, there are a lot more senior population. If we take age difference into account, we would find that the mortality rate only increases 10%. This example shows a very thoughtful concern in medical problem.

Steps after calculating the Observed/Evaluated ratio (need to read more on statistics):

(chi-square for discrete data) statistical test
confidence interval
If the statistical test shows non-significant, check statistical power.

==

Some common concerns for machine learning task as well:
1. Need to consider if the training data and testing data follow the same distribution (features and labels).
2. Selection bias - is the treatment population selected non-randomly?

Sunday, September 3, 2017

[Leetcode] 388. Longest Absolute File Path

Suppose we abstract our file system by a string in the following manner:

The string "dir\n\tsubdir1\n\tsubdir2\n\t\tfile.ext" represents:

dir
    subdir1
    subdir2
        file.ext

The directory dir contains an empty sub-directory subdir1 and a sub-directory subdir2 containing a file file.ext.

The string "dir\n\tsubdir1\n\t\tfile1.ext\n\t\tsubsubdir1\n\tsubdir2\n\t\tsubsubdir2\n\t\t\tfile2.ext" represents:

dir
    subdir1
        file1.ext
        subsubdir1
    subdir2
        subsubdir2
            file2.ext

The directory dir contains two sub-directories subdir1 and subdir2. subdir1 contains a file file1.ext and an empty second-level sub-directory subsubdir1. subdir2 contains a second-level sub-directory subsubdir2 containing a file file2.ext.

We are interested in finding the longest (number of characters) absolute path to a file within our file system. For example, in the second example above, the longest absolute path is "dir/subdir2/subsubdir2/file2.ext", and its length is 32 (not including the double quotes).

Given a string representing the file system in the above format, return the length of the longest absolute path to file in the abstracted file system. If there is no file in the system, return 0.

Note:

The name of a file contains at least a . and an extension.
The name of a directory or sub-directory will not contain a ..

Time complexity required: O(n) where n is the size of the input string.

Notice that a/aa/aaa/file1.txt is not the longest file path, if there is another path aaaaaaaaaaaaaaaaaaaaa/sth.png.

I solved this problem about 7 months ago - when I first started to tackle these Leetcode problems. Back then, I wasn't proficient in any algorithms. But somehow, I managed to solve this problem with an extremely messy code. Ever since I got that job offer, I thought I should re-do some of the problems or focus those problems that require some modeling (i.e., Google problems).

This problem is on the easier side of this kind. It is very obvious files and folders are organized in tree structure. Only if we have a well-organized tree structure, then we can basically use backtracking to find out the answers. However, the input is a string. But, by looking at the string, we can see that the number of '\t' in the element is the same as the level of that elements. For example, '\tsubdir1', '\t\tfile.txt' are in the first, second level of the file organization, respectively.

We can't easily do a tree traverse on this type of string input, but hey, the elements order after input.split('\n') is similar to the in-order traversal of the directory tree (here might be a forest).

Therefore, the abstraction of the problem will be:

Given the in-order traversal of a tree and each element's level, reconstruct the maximum length path from root to leaf where we only considers those leaves that contain '.' (file).

Well, then we just use a stack, when the level is higher than the current level, meaning that the incoming element will be the children of current element; when lower, meaning that the incoming one is the parent; the same, siblings.

class Solution(object):
    def lengthLongestPath(self, input):
        """
        :type input: str
        :rtype: int
        """
        stack = []
        current_level = 0
        res = 0
        for name in input.split('\n'):
            #print stack, current_level
            tabs = name.split('\t')
            if len(tabs) - 1 == current_level:
                if stack:
                    stack.pop()
                stack.append(tabs[-1])
            elif len(tabs) -1 > current_level:
                stack.append(tabs[-1])
            else:
                for _ in range(current_level - len(tabs) + 2):
                    if stack: stack.pop()
                stack.append(tabs[-1])
            if '.' in tabs[-1]:
                res = max(res, len('/'.join(stack)))
            current_level = len(tabs) - 1
        return res

[Update] The end of a period

I got a job offer in a small company in town - it matches my expectation in almost every aspect: meaningful domain, the opportunity to learn and improve, and nice pay.

The interview did not use any of the stuff I have prepared for the last few month - days and nights going through, sometimes painfully, all these Leetcode problems. But I am glad I did it. By doing these algorithmic problems, I have gained a much deeper understanding of common algorithms, been more appreciative of good and efficient code, and cultivated a habit to endure some hard and lonely moments in life.

My progress in Leetcode - I actually never have imagined that I could have done so many problems.

I like doing these problems, so I will likely keep updating some interesting problems I have solved, but to a less intensity. This might be even better, as now I am in a much more relaxing mood - I might be able to think deeper than just finishing as much problems as possible.

I look forward to this new chapter of my life

Sunday, August 27, 2017

[Leetcode] 334. Increasing Triplet Subsequence

This problem reminds me of the fancy version of longest increasing subsequence problem ($O(n\log n)$ solution).

Let's look at an example array: 1, 2, -1, 1, -2, 3. When we go through the first two element, we are all good. Then, we encounter -1, here we face a decision, should we update the current minimum to -1?

Let's see why we should and why we shouldn't. -1 is a safer move than 1, what if later our solution is -1, 0, 1? Then we definitely want to update it. However, you might say, if we update it to -1, then 2 shouldn't be in our solution anymore, according to the "subsequence" rule.

Well, the answer is YES, but we only update the minimum, keeping the second minimum (which is 2 here).The argument is as following: when we keep the second minimum (call it min2), this means that we only count increasing when the next value is greater than min2. You see, by changing min1 to a smaller value, it doesn't really affect the decision we are going to make about adding an increasing element. However, if we see the next value is smaller than min2 but greater than min1, we then want to update min2.

In this way, we always keep track the safest possible value yet maintain the order we have for now.

In actual implementation, I used a stack instead of min1, min2. It turned out to be much easier to handle. Well, who knows such a seeming simple question took me good two hours to think through - this kind of question is very tricky.

Given an unsorted array return whether an increasing subsequence of length 3 exists or not in the array.

Formally the function should:

Return true if there exists i, j, k
such that arr[i] < arr[j] < arr[k] given 0 ≤ i < j < k ≤ n-1 else return false.

Your algorithm should run in O(n) time complexity and O(1) space complexity.

Examples:
Given [1, 2, 3, 4, 5],
return true.

Given [5, 4, 3, 2, 1],
return false.

class Solution(object):
    def increasingTriplet(self, nums):
        """
        :type nums: List[int]
        :rtype: bool
        """
        if len(nums) < 3: return False
        
        counter = 1
        stack = []
        for i in nums:
            if len(stack) == 0:
                stack.append(i)
            elif i < stack[0]:
                stack[0] = i
            elif len(stack) == 1 and i > stack[0]:
                stack.append(i)
            elif len(stack) == 2 and stack[0] < i < stack[1]:
                stack[1] = i
            elif len(stack) == 2 and i > stack[1]:
                return True
        return False

Wednesday, August 23, 2017

[Leetcode] 164. Maximum Gap

A really smart solution using pigeon hole principle and bucket sort...

Given an unsorted array, find the maximum difference between the successive elements in its sorted form.

Try to solve it in linear time/space.

Return 0 if the array contains less than 2 elements.

You may assume all elements in the array are non-negative integers and fit in the 32-bit signed integer range.

class Solution(object):
    def maximumGap(self, nums):
        """
        :type nums: List[int]
        :rtype: int
        """
        if not nums: return 0
        min_v, max_v = min(nums), max(nums)
        if min_v == max_v: return 0
        
        buckets = [[] for _ in range(len(nums)+1)]
        for num in nums:
            bkt = (num - min_v) * len(nums) / (max_v - min_v)
            buckets[bkt].append(num)
        
        i = 0
        res = 0
        while i < len(buckets) - 1:
            k = i + 1
            while len(buckets[k]) == 0:
                k = k + 1
            res = max(res, min(buckets[k]) - max(buckets[i]))
            i = k
        return res

Sunday, August 20, 2017

[MOOC] Implement TrieMatching

This is the third problem in assignment 1 for this class. It has been a long time since I worked on the first problem. I was confused why we have to model the trie as the graph where node is the number, edge the alphabet. Why can't we model it where node is the alphabet?

An counter-example will be: Consider a pattern 'ATAC', and we want to see if it matches to 'ATAT'. If we model the pattern as the first graph representation, it will be {'A': ['T', 'C'], 'T': ['A'], 'C': ['$']}. Then we will say 'ATAT' matches 'ATAC', which is obviously wrong.

Here is the code.

def solve (text, n, patterns):

    result = []
    # write your code here
    trie = build_trie(patterns)
    for i in range(len(text)):
        if match_pattern(text[i:], trie):
            result.append(i)
    return result

def build_trie(patterns):
    """Build trie for patterns"""
    trie = {0: {}}
    counter = 1
    for p in patterns:
        p = p + '$'
        current_node = 0
        for c in p:
            if c in trie[current_node]:
                current_node = trie[current_node][c]
            else:
                trie[current_node][c] = counter
                trie[counter] = {}
                current_node = counter
                counter += 1
    return trie

def match_pattern(text, trie):
    cur_node = 0
    for ch in text:
        if ch in trie[cur_node]:
            cur_node = trie[cur_node][ch]
            if '$' in trie[cur_node]:
                return True
        else:
            return False
    if '$' in trie[cur_node]:
        return True
    else:
        return False

Saturday, August 19, 2017

[Leetcode] Sudoku Solver

Okay, this is a pretty classical backtracking problem. There are some details of writing recursion in python that I can't say I fully understand.

Write a program to solve a Sudoku puzzle by filling the empty cells.

Empty cells are indicated by the character '.'.

You may assume that there will be only one unique solution.

A sudoku puzzle...

I referred to this post to finally get my program running. One thing I did not do is to "return True/False", but handled it like permutation that kind of questions. I do think "return True/False" make sense, but I don't know why I did not think it necessary.

class Solution(object):
    def solveSudoku(self, board):
        """
        :type board: List[List[str]]
        :rtype: void Do not return anything, modify board in-place instead.
        """
        
        self.solve(board, 0, 0)
    
    def solve(self, board, i, j):
        
        def candidates(i, j):
            not_available = []
            for x in range(9):
                if board[i][x] != '.':
                    not_available.append(board[i][x])
                if board[x][j] != '.':
                    not_available.append(board[x][j])
            grid_x, grid_y = i / 3, j / 3
            for x in (0, 1, 2):
                for y in (0, 1, 2):
                    if board[grid_x * 3 + x][grid_y * 3 + y] != '.':
                        not_available.append(board[grid_x * 3 + x][grid_y * 3 + y])
            return [i for i in '123456789' if i not in not_available]
        
        if i == 9: 
            return True
        elif j == 9:
            return self.solve(board, i+1, 0)
        else:            
            if board[i][j] == '.':
                cands = candidates(i, j)
                if not cands:
                    return False
                for cand in cands:
                    board[i][j] = cand
                    if self.solve(board, i, j + 1):
                        return True
                    board[i][j] = '.'
                return False
            else:
                return self.solve(board, i, j+1)

Friday, August 18, 2017

[Leetcode] Smallest Range

This is a smart use of heap.

You have k lists of sorted integers in ascending order. Find the smallest range that includes at least one number from each of the klists.

We define the range [a,b] is smaller than range [c,d] if b-a < d-c or a < c if b-a == d-c.

Example 1:

Input:[[4,10,15,24,26], [0,9,12,20], [5,18,22,30]]
Output: [20,24]
Explanation: 
List 1: [4, 10, 15, 24,26], 24 is in range [20,24].
List 2: [0, 9, 12, 20], 20 is in range [20,24].
List 3: [5, 18, 22, 30], 22 is in range [20,24].

Note:

The given list may contain duplicates, so ascending order means >= here.
1 <= k <= 3500
-105 <= value of elements <= 105.

class Solution(object):
    def smallestRange(self, nums):
        """
        :type nums: List[List[int]]
        :rtype: List[int]
        """
        h = []
        max_v = -10**5 - 1
        min_v = 10**5 + 1
        for i, num in enumerate(nums):
            h.append([num[0], i, 0])
            if max_v < num[0]: max_v = num[0]
            if min_v > num[0]: min_v = num[0]
        res = [min_v, max_v]
        
        heapq.heapify(h)
        
        while True:
            v, list_index, element_index = heapq.heappop(h)
            if element_index == len(nums[list_index]) - 1:
                return res
            heapq.heappush(h, [nums[list_index][element_index+1], list_index, element_index+1]) 
            if nums[list_index][element_index+1] > max_v:
                max_v = nums[list_index][element_index+1]
            if max_v - h[0][0] < res[1] - res[0]:
                res = [h[0][0], max_v]

Friday, August 11, 2017

[Leetcode] 131. Palindrome Partitioning

Well, doing leetcode problems can be addictive - I used to go to Facebook everyday - not any more and I don't feel missed at all. Now, somehow I must do some leetcode problems, otherwise I just feel wrong. It sometimes gets hard and frustrating. But I guess as I read it somewhere, in our nature human beings just love something bitter, like beer, coffee, as well as challenges (but have to be meaningful and able to make progress and intellectually fulfilling, at least for me).

Given a string s, partition s such that every substring of the partition is a palindrome.

Return all possible palindrome partitioning of s.

For example, given s = "aab",
Return

[
  ["aa","b"],
  ["a","a","b"]
]

This problem is similar to subset problem, my solution is not the "optimal" one, but very intuitive.

class Solution(object):
    def partition(self, s):
        """
        :type s: str
        :rtype: List[List[str]]
        """
        if len(s) == 0: return [[]]
        if len(s) == 1: return [[s]]
        
        res = []
        for i in range(1, len(s)+1):
            part1 = s[:i]
            part2 = s[i:]
            #print part1, part2
            if part1 == part1[::-1]:
                tmp_res = self.partition(part2)
                #print tmp_res
                for r in tmp_res:
                    res.append([part1] + r)
        return res

Wednesday, August 9, 2017

[Leetcode] 145. Binary Tree Postorder Traversal

This is not a hard problem compared to some really hard ones. The key is to flag every node, if first in stack, 1, if all its children are in the stack, mark it as 2. Later, when encounter a node, if its flag is 1, then we push its children in the stack and mark it as 2; if it's 2, then we pop it and record the result. The order to put the children? First right, then left --> then we are able to see left first and then the right.

Given a binary tree, return the postorder traversal of its nodes' values.

For example:
Given binary tree {1,#,2,3},

return [3,2,1].

Note: Recursive solution is trivial, could you do it iteratively?

# Definition for a binary tree node.
# class TreeNode(object):
#     def __init__(self, x):
#         self.val = x
#         self.left = None
#         self.right = None

class Solution(object):
    def postorderTraversal(self, root):
        """
        :type root: TreeNode
        :rtype: List[int]
        """
        if not root:
            return []
        visited = {}
        stack = [root]
        visited[root] = 1
        res = []
        
        is_leaf = lambda node: not node.left and not node.right
        
        while stack:
            top = stack[-1]
            if visited[top] == 2 or is_leaf(top):
                top = stack.pop()
                res.append(top.val)
            else: 
                if top.right:
                    stack.append(top.right)
                    visited[top.right] = 1
                if top.left:
                    stack.append(top.left)
                    visited[top.left] = 1
                visited[top] = 2
        return res

[Leetcode] 199. Binary Tree Right Side View

I referred to this solution: it's really succinct and ingenious. I am a little far from this level of coding.

Given a binary tree, imagine yourself standing on the right side of it, return the values of the nodes you can see ordered from top to bottom.

For example:
Given the following binary tree,

1            <---
 /   \
2     3         <---
 \     \
  5     4       <---

You should return [1, 3, 4].

# Definition for a binary tree node.
# class TreeNode(object):
#     def __init__(self, x):
#         self.val = x
#         self.left = None
#         self.right = None

class Solution(object):
    def rightSideView(self, root):
        """
        :type root: TreeNode
        :rtype: List[int]
        """
        def collect(node, depth):
            if not node:
                return
            if len(view) == depth:
                view.append(node.val)
            collect(node.right, depth+1)
            collect(node.left, depth+1)
        
        view = []
        collect(root, 0)
        return view

[Leetcode] 498. Diagonal Traverse

Well, this problem itself has nothing to note. But I must take a screen shot for record - the first time my program beats 100%... just for fun. Problem wise, it's just a bunch of index tricks - along the diagonal, element matrix[i][j], i + j maintains the same.

Given a matrix of M x N elements (M rows, N columns), return all elements of the matrix in diagonal order as shown in the below image.

Example:

Input:
[
 [ 1, 2, 3 ],
 [ 4, 5, 6 ],
 [ 7, 8, 9 ]
]
Output:  [1,2,4,7,5,3,6,8,9]
Explanation:

Note:

The total number of elements of the given matrix will not exceed 10,000.

One thing to note is that the input size says that we can only afford $O(n)$ algorithm.

class Solution(object):
    def findDiagonalOrder(self, matrix):
        """
        :type matrix: List[List[int]]
        :rtype: List[int]
        """
        if not any(matrix): return []
        m, n = len(matrix), len(matrix[0])
        
        if m == 1: return matrix[0]
        if n == 1: return [matrix[i][0] for i in range(m)]
        
        res = []
        for k in range(m+n):
            if k >= n:
                start = k -n + 1 # row 
            else:
                start = 0
            if k >= m:
                end = m - 1 
            else:
                end = k
            tmp = []
            for i in range(start, end + 1):
                tmp.append(matrix[i][k-i])
            
            if k % 2 == 0:
                res += tmp[::-1]
            else:
                res += tmp
        return res

Tuesday, August 8, 2017

[Leetcode] 43. Multiply Strings

Looks like a simple problem. I thought to use Karatsuba algorithm to speed the program up, but it turns out not necessary - as the longest input will just be less than 180, then $O(180^2)$ is definitely good enough.

I briefly looked at some other people's hints and see that the key to this problem is that: num1[i] * num2[j] will be at final result's [i+j] position and [i+j+1] for the carrier.

Given two non-negative integers num1 and num2 represented as strings, return the product of num1 and num2.

Note:

The length of both num1 and num2 is < 110.
Both num1 and num2 contains only digits 0-9.
Both num1 and num2 does not contain any leading zero.
You must not use any built-in BigInteger library or convert the inputs to integer directly.

class Solution(object):
    def multiply(self, num1, num2):
        """
        :type num1: str
        :type num2: str
        :rtype: str
        """
        n1, n2 = len(num1), len(num2)
        res = [0 for _ in range(n1+n2+1)]
        num1 = num1[::-1]
        num2 = num2[::-1]
        for i in range(n1):
            b1 = int(num1[i])
            for j in range(n2):
                b2 = int(num2[j])
                cur_b = res[i+j] + b1 * b2
                if cur_b <= 9:
                    res[i+j] = cur_b
                else:
                    res[i+j] = cur_b % 10
                    res[i+j+1] += cur_b / 10
        res = map(str, res)
        res = ''.join(res[::-1])
        i = 0
        while i < len(res) - 1 and res[i] == '0':
            i += 1
        return res[i:]

[MOOC] Deep Learning Specialization

I heard of this MOOC by Andrew Ng today and am really excited about it. As a huge fan of good MOOCs, I can't wait to try it out.

I long to systematically learn deep learning. I have applied some deep learning methods to some projects and achieved quite good results, and I am amazed by how well they work on certain problems and how easily they can transfer to a completely different task without much extra work.

With Keras package, building a decent neural network is probably just several lines of code, but it will be nice to gain some "correct" insights, especially from the best people in the field.

Speaking of this, I just love how open this field has been. Maybe I am not in the field, but I just feel amazed by how many good tutorials/MOOCs/software packages are out there. I think it definitely brings a better opportunity for whoever wants to study or work on it, but more importantly, as such an important technology that changes or will change people's life so much, these open resources allow people to see clearer what are causing the changes and be aware of them. This awareness, at least to myself, is important.

Monday, August 7, 2017

[Leetcode] Implement Trie (Prefix Tree)

Well, I just finished the week 1 of Algorithms on Strings. I think without doing homework problem there, I would not be able to think of how to represent the tree. BUT, of course, graph, adjacency list.

Implement a trie with insert, search, and startsWith methods.

class Trie(object):

    def __init__(self):
        """
        Initialize your data structure here.
        """
        self.tree = {}
        self.tree[0] = {}
        self.new_node_idx = 1

    def insert(self, word):
        """
        Inserts a word into the trie.
        :type word: str
        :rtype: void
        """
        current_node = 0
        word = word + '$'
        for i in range(len(word)):
            current_symbol = word[i]
            if current_symbol in self.tree[current_node]:
                current_node = self.tree[current_node][current_symbol]
            else:
                self.tree[current_node][current_symbol] = self.new_node_idx
                self.tree[self.new_node_idx] = {}
                current_node = self.new_node_idx
                self.new_node_idx += 1

    def search(self, word):
        """
        Returns if the word is in the trie.
        :type word: str
        :rtype: bool
        """
        word = word + '$'
        current_node = 0
        for i in range(len(word)):
            current_symbol = word[i]
            if current_symbol in self.tree[current_node]:
                current_node = self.tree[current_node][current_symbol]
            else:
                return False
        return True
    
    def startsWith(self, prefix):
        """
        Returns if there is any word in the trie that starts with the given prefix.
        :type prefix: str
        :rtype: bool
        """
        current_node = 0
        for i in range(len(prefix)):
            cur_symbol = prefix[i]
            if cur_symbol in self.tree[current_node]:
                current_node = self.tree[current_node][cur_symbol]
            else:
                return False
        return True

# Your Trie object will be instantiated and called as such:
# obj = Trie()
# obj.insert(word)
# param_2 = obj.search(word)
# param_3 = obj.startsWith(prefix)