Visualizing and Measuring the Geometry of BERT
Preceding work:
This paper builds on A Structural Probe for Finding Syntax in Word Representations by John Hewitt and Chistopher D. Manning. The main contributions of this paper were:
- There are geometric representations of entire dependency parse trees in BERT's activation space
- In Layer 16 (BERT-large)
- After applying a single global linear transformation (which they called structural probe)
- Square of distance between context embeddings is roughly proportional to tree distance in the dependency parse
- Could not be explained by the authors
Key findings:
- Evidences for BERT storing linguistic features in separate syntactic and semantic subspaces
- 2 linear transformations have been found representing a syntactic and a semantic subspace
- These are supposed to be orthognal to each other
- BERT has a fine-grained geometric representation of word senses
- Different word senses build well-separated clusters
- In these clusters, context embeddings seem to encode additional fine-grained meaning
- Attention matrices seem to contain a decent amount of syntactic information
- Mathematical argument for geometry of representations found by Hewitt and Manning
- Pythagorean embedding (power-2 embedding) might very well be the explanation for the distances observed
Probing the attention matrices:
- Task: Classifying the relation between two tokens
- Input: Model-wide attention vector
- Data: Penn Treebank
- 30 relations with 5.000 examples each
- Classifier: 2 L2 regularized linear classifiers
- 1 for classifying the existence of a relation (85.5% accuracy)
- 1 for multiclass prediction of the concrete relation (71.9% accuracy)
Simple mathematical explanation for the tree embeddings that Hewitt and Manning found:
Isometric tree embeddings into Euclidean space are not possible
→ Look for other possible representations
Power-p embeddings:
|| f(x) - f(y)||^p = d(x,y)
- Pythagorean embedding (p = 2):
- p < 2 does not necessarily exits
- p > 2 does exist but has no simple geometric explanation
Visualization of tree embeddings:
Comparison of BERT parse tree embeddings with correct power-2 embeddings
Input for visualization:
- Sentences from Penn Treebank with associated parse tree
- BERT-lage token embeddings in layer 16 (analogue to Hewitt and Manning)
- transformed by "structural probe"
- Dimensionality reduction via PCA
Left images: Original parse tree. Right images: Parse tree from BERT embeddings.
- dependencies have been connected to visualize tree structure
- color of the edge indicates deviation from true tree distance
Question: Is the difference between these projected trees and the canonical ones merely noise or a more interesting pattern?
→ Such systematic differences suggest that BERT's syntactic representation has an additional quantitative aspect beyond traditional dependency grammar
The average squared edge length between two words with a given dependency.
Geometry of word senses:
Question: BERT seems to have several ways of representing syntactic information, but what about semantic features?
Visualization of word senses:
Visualization tool:
- to be released to the public
- Setup:
- All sentences from introductions in English Wikipedia entries
- User enters a word, system gathers 1.000 sentences containing the word
- Contex embeddings are taken from BERT-base from layer of choosing
- Visualization of the 1.000 embeddings with UMAP
Embeddings for the word "die" in different contexts, visualized with UMAP. Sample points are annotated with corresponding sentences. Overall annotations (blue text) are added as a guide.
Findings:
- Well-separated clusters
- fine-grained meaning in clusters (single and multiple people die)
Questions: Can we find quantitative corroboration that word senses are well-represented?
Quantatice analysis of word sense disambiguation:
- Nearest-neighbor clustering on word senses
- SemCore dataset
- Simple nearest-neighbor classifier: 71.1 F1 (higher than state of the art)
- Accuracy monotonically increasing through layers
New probe (linear transformation):
- trained to minimize similarity between senses
- Slightly improved performance on final layer: F1 of 71.5
- but dramatically better performance on earlier layers
→ Suggests more semantic information in earlier layers than one might expect
- but dramatically better performance on earlier layers
- This probe seems to be othogonal to Hewitt and Mannings structural probe
Conclusion:
- the internal geometry of BERT may be broken into multiple linear subspaces, with separate spaces for different syntactic and semantic information