Notes and reflections from "How AI decodes Human Health"
January 11, 2025
10 min read time
Below are my notes based on 3 presentations from "How AI decodes Human Health" at the University of Toronto.
1. Research vs Deployment
The research demonstrated limitations in clinical application of cardiac arrest classification based on 8 biosignals. RLHF was unfeasible in the ER environment due to temporal constraints on clinical staff. Additionally, the model's accuracy got worse for two main reasons:
Feedback amplification: biases are amplified as clinicians ignored their own judgements when results diverged.
Poor data quality: The model was trained on a dataset from 1999-present, introducing potential demographic skew.
2. General applications in Healthcare
EDRN optimization: Optimizing schedules of emergency department registered nurses.
Cancer patient prioritization: Identify and prioritize the most urgent cancer cases.
AI Medical scribes: Clinicians often can't type fast enough to capture all patient details, and missing key facts can lead to misdiagnosis.
Guidance for MIS (minimally invasive surgical) procedures: In laparoscopic cholecystectomy procedures, real-time AI visualization is used to delineate surgical margins, with green regions indicating viable excision zones and red indicating critical structures. Clinical outcomes demonstrate enhanced procedural success rates.
The speaker also mentioned physicians and AI make distinctly different types of errors. His theory suggests that when diagnoses differ, a hybrid approach yields optimal outcomes. If either the clinician or AI concludes True while the other concludes False, the final diagnosis should be True. The results have led to an overall 26% reduction in ER mortality rates.
3. Implementation of AI in Pathology
The speaker focused on the pathology of prostate cancer. He defined false negative as wrongly diagnosing malignant as benign and false positive as misdiagnosing benign as malignant. He believes false negatives are far more dangerous than false positives.
He describes complacency bias, where inexperienced clinicians might discard their judgment in favor of AI when it's wrong, swayed by its perceived superiority. He believes we're in the Gartner hype cycle's "Trough of Disillusionment" and, similar to the previous speaker, preferred the term "augmented intelligence" over "artificial intelligence."
Follow-up
During one of the talks, an audience member asked "Will doctors eventually be replaceable?" to which the speaker responded: "AI won't replace doctors, but doctors who use AI will replace those who don't."
So exactly how does AI's diagnostic errors differ from those of human clinicians?
I found answers in a paper recommended by one of the speakers (Raciti P, Sue J, Retamero JA, et al., 2023). The study shows that pathologists' performance can be affected by fatigue, confirmation bias, and visual acuity.
Types of Errors Observed Without AI (Pathologist Alone):
Missing small tumors: detecting small cancerous foci and well-differentiated tumors.
Misinterpreting benign mimics: misclassifying benign tissue (e.g., seminal vesicles, HGPIN, benign prostatic gland around a nerve) as cancerous.
Types of Errors Observed with AI (Pathologist + PaPr):
PaPr driven inaccuracies: While PaPr helped correct all previously incorrect diagnoses, 85% of initially correct diagnoses became incorrect.
Incorrect area of interest: In one case, while PaPr correctly classified a slide as cancerous, it misidentified the area of interest.
False positives: Although specificity improved, a few WSIs that were truly benign were still incorrectly flagged by both the AI and pathologist as being cancerous.
The key takeaway is that PaPr AI enhances pathologists' detection capabilities by highlighting easily missed tissue features. Sensitivity improved with a 70% reduction in detection errors, helping pathologists identify tumors they might have otherwise overlooked.
Reflections
It's evident that there is a lack of shareable data across clinics. As we know, models require large amounts of high quality data to perform well. While this limitation stems from regulatory concerns, even with available anonymized datasets, data processing remains challenging.
Can we train agents to use existing medical knowledge to simulate training data for specific diagnoses?
Can at home IoT devices securely collect large amounts of personalized data?
The integration of domain knowledge with statistical modeling through LLMs presents an interesting challenge. Experienced pathologists and PaPr interpret slides differently because of their distinct learning approaches—PaPr learns from examples, while pathologists rely on a systematic knowledge to deduce an outcome. Therefore:
Can AI be trained to develop diagnostic CoT reasoning using medical school curriculum?
If clinicians are willing to share their observations and reasoning, can we add multi modal human feedback to the training data?
Would probabilistic diagnostic outcomes enhance clinical decision-making accuracy compared to purely binary classification methods?
AI assistants could help patients communicate their symptoms more precisely using medical terminology.
Optimization of resources in the ER.
In conclusion, I'm left with a lingering thought from House:
"It is in the nature of medicine that you're going to screw up!" — Dr. House
References
Raciti P, Sue J, Retamero JA, et al. Clinical Validation of Artificial Intelligence-Augmented Pathology Diagnosis Demonstrates Significant Gains in Diagnostic Accuracy in Prostate Cancer Detection. Arch Pathol Lab Med. 2023;147(10):1178-1185. doi:10.5858/arpa.2022-0066-OA
The prospect of loss
December 19, 2024
5 min read time
I encountered a phrase like "What's possible isn't always right" while listening to David and Goliath, and it led to some pondering. Although this seemed really like an intuition, yet too often, we spend countless hours down a rabbit hole pursuing a possible solution, yet the right solution may take a simpler but different path.
Take, for example, scraping Wikipedia pages with BeautifulSoup. While handling numerous tag edge cases is possible and fixing each one is oddly satisfactory, is it really the right approach? Had I known about SPARQL earlier, I wouldn't have wasted time trying to scrape Wikipedia.
So why do we persist with familiar but suboptimal solutions? Most literature points to cognitive biases like loss aversion, where we, as irrational beings, feel losses more strongly than gains even at the same magnitude. While I was comfortable with Python and saw a clear path forward, SPARQL presented uncertainties such as a whether it was able to handle my use case. So it really felt like I was embracing my inner human instincts and avoiding uncertainty (blah blah blah). But if we re-consider the situation from the loss perspective, it turns out to be quite the contrary. Here are the 2 options:
100% chance of spending t hours learning about SPARQL
p% chance of spending d extra hours parsing html tags traditionally vs. using SPARQL
where d > t and p < 100%
Therefore option 2 was actually chosen because I was strangely risk seeking when facing losses. I preferred the possibility of having to spend d extra hours because p < 100%, which left a possibility of losing nothing at all. This explains that appearing risk averse can actually lead to taking larger risks, resulting in lower expected utility.
Two issues from building interactive visualization via AI code generation
December 16, 2024
5 min read time
Lately, one of the most repetitive tasks encountered at work is creating and maintaining project Gantt charts in spreadsheets. If only we could just upload a CSV containing JIRA tickets and have it automatically create a Gantt chart for us...
A secondary objective of this expedition is to evaluate AI's code generation capabilities. I used Perplexity Pro for this exercise. It was great at generating boilerplate code, but it often suggested simpler, brute force workarounds. For example, to fix axis misalignment, it will transform the x axis by a hardcoded amount of pixels. For subtler bugs, the proposed solutions often created a vicious cycle. Basically the experience reminded me a bit of the Mr.Bean painting restoration scene (start at 3:45):
Digressions aside, here are 2 memorable issues I encountered:
Issue 1: Axis labels don't automatically adjust when zooming in or out, which causes them to either overlap or leave empty spaces on the canvas
AI’s solution: add style transformation to scale the width of the component like so
style: {{width: {100*zoom}%}}
But the problem is the axis labels shrink as user zooms out, leaving empty spaces around the chart area. I first considered using style: overflow-x: True to automatically show extra labels. However, I realized it's impractical to have an infinitely sized axis that would populate with data no matter how far you zoom out. A more efficient solution would be to dynamically compute the axis labels based on the zoom level.
If we can establish a base date range for a base zoom level, we can then adjust the date range based on the zoom adjusted distance to the center date:
But how can we compute axis label count?
I sought inspiration from AI. I must admit it is an amazing research companion. A common proposed idea was establishing a tick count and step size to build an array. Even though it just couldn't tell me exactly what step size to use... but we just had to apply a bit of intuition:
Issue 2: Taskbar start and end dates don't align with the axis labels, and this misalignment becomes more pronounced with zooming.
This was a bit trickier because I could not figure out how to get help from AI. Here are its suggestions:
Back to old-fashioned debugging. I noticed that the misalignment grew larger as we zoomed in or out. This suggested a small initial error that was being propagated by the zoom factor. When inspecting elements at zoom=1 (the base case), there is an extra gap after the end date, even with correct start and end dates. This is why the taskbar computations were off. The previous axis label count calculation was wrong because it considered end as the last tick occupying element, but actually end should be invisible. The corrected formula should be:
One final observation is AI's bias for generating lengthy code, which is a bit hard to read sometimes. Nevertheless, it excels at handling tedious tasks while providing detailed documentation. Perhaps this will allow us to focus on more specific problems.
Default dict in python
January 7, 2020
10 min read time
Last week, while writing python code, I became deeply perplexed when my code repeatedly did not generate the desired results. After tracing it multiple times, I discovered an interesting behaviour of defaultdict.
Defaultdicts are python collections type which offer default values for dictionaries. This is supplied through the default_factory which generates the default value. From the docs:
dict subclass that calls a factory function to supply missing values
class collections.defaultdict(default_factory=None, /[, ...])
Without it, we can accomplish the same by using set_default() however the docs mention default dict as a faster approach
Here's how I setup my defaultdict
Now if I access a random k which does not exist in d, it returns None
But what if I try accessing it through [ operator
This is where the magic of defaultdict begins. because now when I try calling get. This is because the key exists in the dictionary:
This because if I inspect d, somewhere 3 and its default value was updated in the dict
So what happened? Looking deeper into the __getitem__ call for regular dictionaries:
d[key]
Return the item of d with key key. Raises a
KeyError
if key is not in the map.If a subclass of dict defines a method
__missing__()
and key is not present, thed[key]
operation calls that method with the key key as argument. Thed[key]
operation then returns or raises whatever is returned or raised by the__missing__(key)
call. No other operations or methods invoke__missing__()
. If__missing__()
is not defined,KeyError
is raised.__missing__()
must be a method; it cannot be an instance variable:
And looking back at defaultdict objects, __missing__ is called when default factory is provided through defaultdict's constructor:
If
default_factory
is notNone
, it is called without arguments to provide a default value for the given key, this value is inserted in the dictionary for the key, and returned.
This method is called by the__getitem__()
method of thedict
class when the requested key is not found; whatever it returns or raises is then returned or raised by__getitem__()
.
But it also mentions:
Note that
__missing__()
is not called for any operations besides__getitem__()
. This means thatget()
will, like normal dictionaries, returnNone
as a default rather than usingdefault_factory
.
And that explains why. Basically defaultdict.get() will not invoke the missing method which invokes default_factory whereas defaultdict.__getitem__() (or also known as d[k]) will invoke missing to fill the key. This explains the initial discrepancy between d.get(k) vs. d[k]. However it does not explain why after invoking d[k], d is mutated with k and its result from __missing__(k). So I decided to take a venture into python's source code and found the code in missing:
So what happened is that actually this method calls __setitem__ with the default value (supplied by default factory
As we know from the docs, __setitem__ is called when an assignment occurs which actually mutates the dictionary. That is why when we access d[k], the underlying d is mutated with the default value of k.
Class attributes in Python
January 26, 2019
10 min read time
These days I have encountered some mysterious bugs while interacting with class attributes, and I thought it’s worthwhile to look further into it. Let’s start with some background on classes:
A class contains data field descriptions (or properties, fields, data members, or attributes). These are usually field types and names that will be associated with state variables at program run time; these state variables either belong to the class or specific instances of the class.
– Class (computer programming), Wikipedia
So a class attribute is essentially an attribute belonging to a class. This means any changes made to the class attribute should be reflected in instances of the class. For example, operating_system
would be the class attribute in the following class:
Here change to the class attribute is propagated on its instances. But in Python, we can actually reassign the class attribute from the specific instance:
Notice how m2’s value has not changed. Well this is because when a class attribute is assigned by an instance, Python will add it to its instance namespace which “overrides” its value from the class namespace. However, m2 is still accessing the attribute from the class namespace because such value is missing in its instance namespace.
This gets more interesting with mutable objects. Suppose we wish to keep track of the priority of images on the minion:
What’s happened is that class attribute changes (made from an instance) is propagated to ALL instances of the class!
Well this is because during the assignment of the class attribute, the attribute is passed in by reference to the instance namespace so all attributes are effectively referencing the same location in memory. Better to use an instance attribute here instead.
Does that mean mutable class attributes should never be used? IMHO there could be situations where one wishes to maintain information collected from class instances. For example: tracking all os images ever created on minion instances to monitor duplicates or just for bookkeeping purposes.