|
|
# Visualizing and Measuring the Geometry of BERT
|
|
|
|
|
|
## Preceding work:
|
|
|
This paper builds on [A Structural Probe for Finding Syntax in Word Representations](https://nlp.stanford.edu/pubs/hewitt2019structural.pdf) by John Hewitt and Chistopher D. Manning. The main contributions of this paper were:
|
|
|
- There are geometric representations of entire dependency parse trees in BERT's activation space
|
|
|
- In Layer 16 (BERT-large)
|
|
|
- After applying a single global linear transformation (which they called structural probe)
|
|
|
- Square of distance between context embeddings is roughly proportional to tree distance in the dependency parse
|
|
|
- Could not be explained by the authors
|
|
|
|
|
|
|
|
|
## Key findings:
|
|
|
- Evidences for BERT storing linguistic features in separate syntactic and semantic subspaces
|
|
|
- 2 linear transformations have been found representing a syntactic and a semantic subspace
|
|
|
- These are supposed to be orthognal to each other
|
|
|
- BERT has a fine-grained geometric representation of word senses
|
|
|
- Different word senses build well-separated clusters
|
|
|
- In these clusters, context embeddings seem to encode additional fine-grained meaning
|
|
|
- Attention matrices seem to contain a decent amount of syntactic information
|
|
|
- Mathematical argument for geometry of representations found by Hewitt and Manning
|
|
|
- Pythagorean embedding (power-2 embedding) might very well be the explanation for the distances observed
|
|
|
|
|
|
## Probing the attention matrices:
|
|
|
- Task: Classifying the relation between two tokens
|
|
|
- Input: Model-wide attention vector
|
|
|
- Concatenation of all 12 heads for all of the 12 layers
|
|
|
![](./uploads/geometry_fig_1.png)
|
|
|
- Data: Penn Treebank
|
|
|
- 30 relations with 5.000 examples each
|
|
|
- Classifier: 2 L2 regularized linear classifiers
|
|
|
- 1 for classifying the existence of a relation (85.5% accuracy)
|
|
|
- 1 for multiclass prediction of the concrete relation (71.9% accuracy)
|
|
|
|
|
|
|
|
|
|
|
|
## Simple mathematical explanation for the tree embeddings that Hewitt and Manning found:
|
|
|
Isometric tree embeddings into Euclidean space are **not** possible <br>
|
|
|
→ Look for other possible representations
|
|
|
|
|
|
**Power-p embeddings**:
|
|
|
- $`|| f(x) - f(y)||^p = d(x,y)`$
|
|
|
- Pythagorean embedding (p = 2):
|
|
|
- Especially easy (just go into the direction of a unit base vector)
|
|
|
![](./uploads/geometry_fig_5.png)
|
|
|
- Any tree with n nodes has a Pythagorean embedding into $`\mathbb{R}^{n-1}`$
|
|
|
- A Pythagorean embedding would explain the square root distances found by Hewitt and Manning
|
|
|
- p < 2 does not necessarily exits
|
|
|
- p > 2 does exist but has no simple geometric explanation
|
|
|
|
|
|
## Visualization of tree embeddings:
|
|
|
Comparison of BERT parse tree embeddings with correct power-2 embeddings
|
|
|
|
|
|
Input for visualization:
|
|
|
- Sentences from Penn Treebank with associated parse tree
|
|
|
- BERT-lage token embeddings in layer 16 (analogue to Hewitt and Manning)
|
|
|
- transformed by "structural probe"
|
|
|
- Dimensionality reduction via PCA
|
|
|
|
|
|
![](./uploads/geometry_fig_2.png)
|
|
|
*Left images: Original parse tree. Right images: Parse tree from BERT embeddings.*
|
|
|
- dependencies have been connected to visualize tree structure
|
|
|
- color of the edge indicates deviation from true tree distance
|
|
|
|
|
|
**Question**: Is the difference between these projected trees and the canonical ones merely noise or a more interesting pattern? <br>
|
|
|
→ Such systematic differences suggest that BERT's syntactic representation has an additional quantitative aspect beyond traditional dependency grammar
|
|
|
|
|
|
![](./uploads/geometry_fig_3.png)
|
|
|
*The average squared edge length between two words with a given dependency.*
|
|
|
|
|
|
|
|
|
## Geometry of word senses:
|
|
|
|
|
|
**Question**: BERT seems to have several ways of representing syntactic information, but what about semantic features?
|
|
|
|
|
|
### Visualization of word senses:
|
|
|
Visualization tool:
|
|
|
- to be released to the public
|
|
|
- Setup:
|
|
|
- All sentences from introductions in English Wikipedia entries
|
|
|
- User enters a word, system gathers 1.000 sentences containing the word
|
|
|
- Contex embeddings are taken from BERT-base from layer of choosing
|
|
|
- Visualization of the 1.000 embeddings with UMAP
|
|
|
|
|
|
![](./uploads/geometry_fig_4.png)
|
|
|
*Embeddings for the word "die" in different contexts, visualized with UMAP. Sample points are annotated with corresponding sentences. Overall annotations (blue text) are added as a guide.*
|
|
|
|
|
|
Findings:
|
|
|
- Well-separated clusters
|
|
|
- fine-grained meaning in clusters (single and multiple people die)
|
|
|
|
|
|
**Questions:** Can we find quantitative corroboration that word senses are well-represented?
|
|
|
|
|
|
### Quantatice analysis of word sense disambiguation:
|
|
|
- Nearest-neighbor clustering on word senses
|
|
|
- SemCore dataset
|
|
|
- Simple nearest-neighbor classifier: 71.1 F1 (higher than state of the art)
|
|
|
- Accuracy monotonically increasing through layers
|
|
|
|
|
|
New probe (linear transformation):
|
|
|
- trained to minimize similarity between senses
|
|
|
- Slightly improved performance on final layer: F1 of 71.5
|
|
|
- but dramatically better performance on earlier layers <br>
|
|
|
→ Suggests more semantic information in earlier layers than one might expect
|
|
|
- This probe seems to be othogonal to Hewitt and Mannings structural probe
|
|
|
|
|
|
**Conclusion**:
|
|
|
- the internal geometry of BERT may be broken into multiple linear subspaces, with separate spaces for different syntactic and semantic information
|
|
|
|