Jonas Wallat · f0bd4e67
--- a/Visualizing-and-Measuring-the-Geometry-of-BERT.md
+++ b/Visualizing-and-Measuring-the-Geometry-of-BERT.md
+# Visualizing and Measuring the Geometry of BERT
+
+## Preceding work:
+This paper builds on [A Structural Probe for Finding Syntax in Word Representations](https://nlp.stanford.edu/pubs/hewitt2019structural.pdf) by John Hewitt and Chistopher D. Manning. The main contributions of this paper were:
+- There are geometric representations of entire dependency parse trees in BERT's activation space
+  - In Layer 16 (BERT-large)
+  - After applying a single global linear transformation (which they called structural probe)
+- Square of distance between context embeddings is roughly proportional to tree distance in the dependency parse
+  - Could not be explained by the authors
+
+
+## Key findings:
+- Evidences for BERT storing linguistic features in separate syntactic and semantic subspaces
+  - 2 linear transformations have been found representing a syntactic and a semantic subspace
+  - These are supposed to be orthognal to each other 
+- BERT has a fine-grained geometric representation of word senses
+  - Different word senses build well-separated clusters
+  - In these clusters, context embeddings seem to encode additional fine-grained meaning
+- Attention matrices seem to contain a decent amount of syntactic information
+- Mathematical argument for geometry of representations found by Hewitt and Manning
+  - Pythagorean embedding (power-2 embedding) might very well be the explanation for the distances observed
+
+## Probing the attention matrices:
+- Task: Classifying the relation between two tokens
+- Input: Model-wide attention vector
+  - Concatenation of all 12 heads for all of the 12 layers
+  ![](./uploads/geometry_fig_1.png)
+- Data: Penn Treebank
+  - 30 relations with 5.000 examples each
+- Classifier: 2 L2 regularized linear classifiers
+  - 1 for classifying the existence of a relation (85.5% accuracy)
+  - 1 for multiclass prediction of the concrete relation (71.9% accuracy)  
+
+
+
+## Simple mathematical explanation for the tree embeddings that Hewitt and Manning found:
+Isometric tree embeddings into Euclidean space are **not** possible <br>
+→ Look for other possible representations
+
+**Power-p embeddings**:
+- $`|| f(x) - f(y)||^p = d(x,y)`$
+- Pythagorean embedding (p = 2):
+  - Especially easy (just go into the direction of a unit base vector)
+  ![](./uploads/geometry_fig_5.png)
+  - Any tree with n nodes has a Pythagorean embedding into $`\mathbb{R}^{n-1}`$
+  - A Pythagorean embedding would explain the square root distances found by Hewitt and Manning 
+- p < 2 does not necessarily exits
+- p > 2 does exist but has no simple geometric explanation
+
+## Visualization of tree embeddings:
+Comparison of BERT parse tree embeddings with correct power-2 embeddings
+
+Input for visualization:
+- Sentences from Penn Treebank with associated parse tree
+- BERT-lage token embeddings in layer 16 (analogue to Hewitt and Manning)
+  - transformed by "structural probe"
+- Dimensionality reduction via PCA
+
+![](./uploads/geometry_fig_2.png)
+*Left images: Original parse tree. Right images: Parse tree from BERT embeddings.*
+- dependencies have been connected to visualize tree structure
+- color of the edge indicates deviation from true tree distance
+
+**Question**: Is the difference between these projected trees and the canonical ones merely noise or a more interesting pattern? <br>
+→ Such systematic differences suggest that BERT's syntactic representation has an additional quantitative aspect beyond traditional dependency grammar
+
+![](./uploads/geometry_fig_3.png)
+*The average squared edge length between two words with a given dependency.*
+
+
+## Geometry of word senses:
+
+**Question**: BERT seems to have several ways of representing syntactic information, but what about semantic features?
+
+### Visualization of word senses:
+Visualization tool:
+- to be released to the public
+- Setup:
+  - All sentences from introductions in English Wikipedia entries
+  - User enters a word, system gathers 1.000 sentences containing the word
+  - Contex embeddings are taken from BERT-base from layer of choosing
+  - Visualization of the 1.000 embeddings with UMAP
+
+![](./uploads/geometry_fig_4.png)
+*Embeddings for the word "die" in different contexts, visualized with UMAP. Sample points are annotated with corresponding sentences. Overall annotations (blue text) are added as a guide.*
+
+Findings:
+- Well-separated clusters
+- fine-grained meaning in clusters (single and multiple people die)
+
+**Questions:** Can we find quantitative corroboration that word senses are well-represented?
+
+### Quantatice analysis of word sense disambiguation:
+- Nearest-neighbor clustering on word senses
+- SemCore dataset
+- Simple nearest-neighbor classifier: 71.1 F1 (higher than state of the art)
+  - Accuracy monotonically increasing through layers
+
+New probe (linear transformation):
+- trained to minimize similarity between senses
+- Slightly improved performance on final layer: F1 of 71.5
+  - but dramatically better performance on earlier layers <br>
+    → Suggests more semantic information in earlier layers than one might expect
+- This probe seems to be othogonal to Hewitt and Mannings structural probe
+
+**Conclusion**: 
+- the internal geometry of BERT may be broken into multiple linear subspaces, with separate spaces for different syntactic and semantic information
+